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Abstract 


Optimizing  compilers  can  produce  very  efficient  code  but  incur  a  high  compilation  cost.  Because 
interactions  between  arbitrary  points  in  a  program  are  possible,  global  compiler  optimizations  are 
inherently  expensive  and  in  general  there  is  no  easy  way  to  partition  the  data  processed  during  global 
compiler  optimizations  into  ^dependent  components. 

This  thesis  explores  how  program  structure  can  be  used  to  provide  a  basis  for  the  parallelization  for  global 
compiler  optimizations.  Two  concepts  to  obtain  data  parallelism  in  global  compiler  optimizations  are 
described.  The  first  concept  uses  program  structure  explicitly  for  the  parallelization,  demonstrated  by  the 
parallelization  of  interval  analysis  of  global  data  flow  equations.  The  second  concept  consists  of  using 
program  structure  analytically  to  establish  data  partitioning  points.  The  effectiveness  of  this  concept  is 
demonstrated  in  the  parallelization  of  global  register  allocation  via  optimal  coloring  of  a  graph  denoting 
register  conflicts,  an  NP  complete  problem.  A  model  for  global  register  allocation  in  which  program 
structure  is  used  to  analyze  a  register  conflict  graph  is  presented.  The  purpose  of  this  analysis  is  to  detect 
clique  separators  that  partition  a  conflict  graph  into  independent  components  that  can  be  colored 
independently  and  combined  to  an  overall  coloring  by  renaming.  Properties  of  live  ranges  in  loops  and 
conditionals  are  linked  to  characteristics  of  the  conflict  graph.  If  certain  restrictions  are  met  by  the  live 
ranges  that  occur  in  conditionals  and  loops,  the  register  conflict  graph  can  be  transformed  to  an  equivalent 
interval  graph.  Interval  register  conflict  graphs  are  desirable  because  they  can  be  colored  optimally  in 
polynomial  time  and  because  all  clique  separators  of  an  interval  graph  can  be  located  systematically.  The 
experimental  evaluation  of  my  method  shows  that  in  many  cases  the  entire  conflict  graph  or  large  portions 
thereof  can  be  mapped  to  equivalent  interval  graphs.  Consequently,  in  such  conflict  graphs  almost  all 
clique  separators  can  be  detected  which  makes  it  easy  to  partition  the  conflict  graph  into  components  that 
can  be  processed  independently.  The  knowledge  about  interval  portions  of  register  conflict  graph  can  be 
used  both  as  a  platform  for  the  parallelization  of  global  register  allocation  and  to  improve  sequential 
register  coloring  algorithms. 
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Chapter  1 
Introduction 


Optimizing  compilers  can  produa  ~ry  efficient  code  but  inctx*  a  high  compilation  cost  To  hide  the 
details  of  complicated  architectures  ui.'  us<u  ar.L  the  same  time  lake  advantage  of  a  machine’s 
processing  power,  compilers  for  such  machines  must  optimize  extensively.  Compiler  optimizations  consist 
of  machine  independent  optimizations  that  are  useful  for  every  backend  and  machine  dependent 
optimizations  tailored  to  specific  architectures.  Global  compiler  optimizations  are  inherently  expensive 
because  they  potentially  depend  on  the  flow  of  data  in  any  part  of  a  program.  The  name  suggests  that 
optimizing  compilers  produce  optimal  code,  and  given  a  specific  input  program  and  a  specific  machine, 
theoretically  it  is  possible  to  produce  optimal  code.  In  practice  however,  optimal  code  can  hardly  be 
achieved.  The  reason  is  that  a  number  of  individual  global  compiler  optimizations  are  NP  complete,  so 
finding  an  optimal  solution  for  one  such  optimization  can  require  exponential  compilation  time.  In 
addition,  compiler  optimizations  interact  with  each  other  and  therefore  optimization  times  are  non  trivial 
even  when  the  resulting  code  is  not  optimal.  For  some  optimizing  compilers,  compilation  times  measured 
in  hours  are  not  unusual  [GrossZobel  89],  This  is  not  surprising  when  one  considers  the  hard  problems  that 
must  be  solved  for  example  by  a  compiler  for  pipelined  machines  [Lam  88],  machines  with  complicated 
memory  organizations  or  vectorized  machines  [Sites  78a,  Sites  78b,  KuckKuhnEtAl  81].  The  trend 
towards  high  compilation  times  is  even  more  evident  in  compilers  for  MIMD  machines:  compilers  for 
MIMD  machines  must  generate  efficient  code  for  a  number  of  (potentially  different)  processors. 
Ultimately,  compilers  will  be  one  of  the  major  bottlenecks  in  the  software  development  cycle. 

1.1.  Motivation 

The  focus  of  this  thesis  is  to  investigate  parallel  frameworks  for  global  compiler  optimizations.  We  chose 
to  investigate  the  parallelization  of  global  compiler  optimizations  for  two  reasons.  The  initial  motivation 
was  th<*  need  to  speedup  compilation.  In  most  optimizing  compilers,  global  optimizations  and  code 
generation  account  for  most  of  the  compilation  time,  so  io  speedup  the  compilation  process  most 
effectively,  the  parallelization  of  global  optimizations  and  code  generation  must  be  considered.  Unlike 
code  generation,  most  global  optimizations  are  independent  of  the  target  machine,  so  their  parallelization 
can  be  incorporated  into  any  optimizing  compiler. 

The  second  motivation  was  to  study  the  interactions  between  program  structure  and  the  characteristics  of 
global  optimization  problems.  In  a  mathematical  sense,  many  global  optimization  problems  are  NP 
complete.  One  reason  why  some  global  optimization  problems  are  so  hard  to  solve  is  that  the  data  for  such 
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optim  zations  is  processed  independently  of  the  input  program.  One  example  of  such  an  optimization  is 
global  register  allocation:  a  widely  used  approach  to  global  register  allocation  is  based  on  graph  coloring. 
For  ?  given  input  program,  a  graph  denoting  register  conflicts  is  constructed  and  an  optimal  assignment  of 
registers  to  live  ranges  in  the  program  is  equivalent  to  an  optimal  coloring  of  this  graph.  Finding  a  parallel 
framework  in  which  it  is  possible  to  partition  this  graph  into  independent  components  that  can  be  colored 
individually  is  hard  when  only  the  graph  is  considered.  In  this  thesis,  we  examine  whether  knowledge 
about  the  structure  of  the  L'  ^ut  program  can  be  used  to  detect  locality  in  the  data  processed  in  global 
compiler  optimizations  that  can  be  used  to  detect  data  parallelism. 

12.  Paradigms  for  parallelism 

The  parallelization  of  an  algorithm  is  useful  only  if  certain  conditions  are  met  First  given  the  solution  i  f 
sequential  algorithm  for  a  problem,  the  solution  of  the  parallel  algorithm  for  the  same  problem  must  be  at 
least  as  good  as  the  sequential  solution.  In  other  words,  for  the  purpose  of  this  thesis  we  do  not  areept 
decrease  of  the  quality  of  the  parallel  solution  in  return  for  parallel  speedup.  Second,  it  must  be  ensured 
that  partial  results  can  be  combined  to  a  global  result  without  compromising  the  quality  of  the  global  result 
If  for  instance  the  solutions  for  small  components  are  optimal,  we  require  that  the  combination  of  the 
individual  optimal  solutions  results  in  an  overall  optimal  solution.  Third,  the  re-combinatiou  of  partial 
results  to  an  overall  result  must  be  fast  enough  such  that  parallel  speedup  is  not  offset  by  the  post 
processing  time  required  for  re-combination. 


1J.  Benefits  of  data  parallelism 

The  research  for  the  parallelization  of  a  given  problem  consists  of  developing  an  algorithm  that  partitions 
the  data  into  components  that  can  be  processed  independently.  The  parallelization  of  a  given  problem 
requires  the  careful  analysis  of  the  problem;  in  many  cases  this  leads  to  a  better  understanding  ot  the 
problem  itself  which  can  lead  to  an  improvement  of  sequential  algorithms  as  well.  Partitioning  the  data 
into  independent  components  has  advantages  for  both  parallel  and  sequential  implementations,  because 
smaller  subcomponents  of  the  input  are  processed  independently. 

Divide  and  conquer: 

Given  polynomial  running  time  for  a  sequential  algorithm,  partitioning  the  data  into  independent  smaller 
subcomponents  allows  the  reduction  of  the  overall  running  time  even  when  the  algorithm  runs  sequentially. 
Because  polynomial  running  time  (at  least  0(r t2)  where  n  is  the  size  of  the  input)  grows  non-lineariy  with 
increasing  problem  size,  solving  smaller  problems  is  cheaper  than  solving  larger  problems.  This  advantage 
becomes  even  more  apparent  when  NP  complete  problems  are  considered:  dividing  a  problem  into  smaller 
components  has  the  result  that  without  increasing  the  overall  running  time,  more  effort  can  be  spent  to  find 
a  good  solution  for  the  individual  components;  this  means  that  expensive  heuristics  or  even  exhaustive 
search  can  be  applied  if  the  subproblems  can  be  made  small  enough. 
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Parallel  speedup: 

Given  the  points  at  which  the  data  can  be  partitioned,  the  parallel  speedup  depends  on  several  factors: 

1.  An  upper  bound  for  the  theoretical  parallel  speedup:  The  maximal  number  of  processors  that 
can  operate  concurrently  for  a  given  problem  is  an  upper  bound  for  the  theoretical  para  !cl 
speedup. 

2.  Upper  bounds  for  the  observed  parallel  speedup:  For  a  given  parallelization,  the  u.tal 
sequential  processing  time  divided  by  the  maximal  processing  time  of  the  parallel  tasks  is  an 
upper  bound  for  the  maximal  parallel  speedup  that  can  be  observed.  Both  the  amount  of 
communication  between  concurrent  processes,  the  time  to  setup  the  parallel  execution  and  the 
postprocessing  time  limit  the  actual  observed  parallel  speedup.  Systems  overhead  such  as  bus 
contention,  scheduling  overhead,  parallel  startup  time  etc.  are  problem  independent  factors 
that  limit  the  observed  parallel  speedup. 

A  successful  parallelization  of  an  algorithm  therefore  consists  of  two  parts,  the  data  partitioning  of  the  input 
and  the  parallelization  itself. 


1.4.  Data  parallelism  based  on  program  structure 

Global  compiler  optimizations  are  inherently  hard  to  parallelize  because  potential  interactions  between 
any  two  parts  of  the  input  program  are  possible.  Due  to  these  interactions,  the  data  that  must  be  processed 
in  global  optimizations  is  highly  interconnected  and  there  is  no  simple  way  to  detect  data  partitioning 
points.  The  research  focus  of  this  thesis  is  to  establish  the  role  of  program  structure  for  the  partitioning  of 
data  that  is  processed  ir<  global  compiler  optimizations.  We  are  more  interested  in  developing  frameworks 
in  which  data  partitionings  can  be  found  in  a  methodic  way,  and  the  thesis  claim  is  that  this  method  to  find 
data  parallelism  is  based  on  the  structure  of  the  input  program. 


1.4.1.  Two  types  of  global  compiler  optimizations 

We  distinguish  between  two  types  of  global  compiler  optimizations:  structured  and  unstructured 
compiler  optimizations.  In  structured  global  compiler  optimizations  the  basic  blocks  of  a  program  are 
processed  in  a  particular  order  that  is  based  on  the  parse  tree  of  the  input  program.  Examples  of  structured 
compiler  optimizations  are  optimizations  that  operate  based  on  a  program’s  loop  structure  such  as 
vectorization  and  software  pipelining  and  all  data  flow  problems.  Data  flow  problems  can  be  solved  in 
polynomial  time,  and  there  exist  several  algorithms  that  take  advantage  of  locality  of  the  data  in  loops:  the 
partial  results  for  an  inner  loop  need  not  be  recomputed  for  the  data  flow  analysis  of  the  enclosing  loop. 
Note  that  in  this  type  of  optimization  algorithms,  backtracking  across  loop  boundaries  is  usually  not 
applied. 

The  difference  between  unstructured  and  structured  optimizations  is  that  in  unstructured  optimizations  the 
order  in  which  the  data  of  the  input  program  is  processed  is  independent  of  the  parse  tree.  In  general  data 
partitioning  based  on  the  parse  tree  of  the  input  program  does  not  yield  independent  components.  In  global 
register  allocation,  an  example  of  an  unstructured  global  compiler  optimization,  it  is  generally  not  the  case 
that  the  partial  solution  for  an  inner  loop  can  be  incorporated  into  the  solution  of  the  enclosing  loop  without 
re-computation.  The  lack  of  easily  detectable  data  locality  greatly  influences  possible  parallelizations  of 
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such  a  problem.  The  challenge  for  such  optimization  is  to  find  a  model  that  permits  us  to  incorporate 
enough  knowledge  about  the  specific  problem  instance  such  that  data  partitioning  points  can  be  found, 
though  usually  not  at  locations  that  coincide  with  loop  or  conditional  entries  and  exits. 

13.  Two  representative  optimizations 

To  assess  how  much  parallel  frameworks  for  global  compiler  optimizations  depend  on  program  structure, 
we  develop  parallel  frameworks  for  both  a  structured  and  an  unstructured  global  compiler  optimization. 
We  concentrate  on  interval  analysis  of  g'obal  data  flow  problems  and  on  global  register  allocation  as 
representatives  for  both  classes.  This  choice  was  guided  by  the  generality  of  both  optimizations;  both  are 
machine  independent  and  important  optimizations  used  in  a  number  of  optimizing  compilers. 

13.1.  Parallel  interval  analysis  of  data  flow  equations 

Interval  analysis  provides  a  structured  framework  for  all  global  data  flow  problems.  The  purpose  of 
global  data  flow  analysis  is  to  collect  information  about  the  flow  of  data  for  every  basic  block  in  a  program. 
Global  data  flow  analysis  is  prerequisite  for  all  global  compiler  optimizations.  Some  optimizations  like  for 
instance  dead  code  removal  or  common  subexpression  elimination  change  the  data  flow  information  and 
therefore  global  data  flow  analysis  must  be  carried  out  several  times  during  global  optimizations.  A 
parallelization  of  interval  analysis  can  be  expanded  to  a  paiallelization  of  all  flow  problems. 

Data  parallelism  based  on  explicit  program  structure 

In  global  data  flow  analysis,  locality  of  data  is  given  explicitly  by  the  loop  structure  of  a  program. 
Interval  analysis  is  a  method  in  which  the  data  flow  information  is  gathered  on  a  per-loop  basis,  starting 
with  the  innermost  loops.  The  data  flow  information  for  an  inner  loop  need  not  be  re-computed  for  the  data 
flow  information  of  the  enclosing  loop  so  in  other  words,  interval  analysis  does  not  require  backtracking. 
The  loop  structure  of  the  program  is  the  basis  for  the  data  partitioning  for  our  parallelization  of  interval 
analysis.  Because  data  locality  is  given  explicitly  by  the  loop  structure,  our  parallelization  of  interval 
analysis  is  straightforward.  We  evaluate  our  approach  by  a  parallel  implementation  and  present  our 
measurements  of  the  parallel  speedup  over  a  standard  sequential  implementation. 

13 2.  Data  partitioning  in  global  register  allocation 

Global  register  allocation  is  the  problem  of  mapping  variables  and  temporary  variables  created  by 
compiler  optimizations  to  registers  efficiently.  Because  access  times  to  registers  are  considerably  lower 
than  access  times  to  memory,  it  is  desirable  to  keep  as  many  variables  as  possible  in  registers  during 
program  execution.  Registers  are  a  scarce  resource  on  every  computer  architecture,  therefore  the  number 
of  registers  is  usually  no  sufficient  to  hold  all  variables  of  an  input  program.  Optimal  global  register 
alloction  is  mathematically  equivalent  to  finding  an  optimal  coloring  for  an  arbitrary  graph  and  therefore 
NP  complete,  but  there  are  certain  types  of  graphs  for  which  am  optimal  coloring  can  be  found  in 
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polynomial  time.  Interval  graphs  can  be  colored  optimally  in  linear  time  and  are  particularly  important  in 
the  context  of  parallel  global  register  allocation  via  graph  coloring  for  twc  reasons.  First,  many  register 
conflict  graphs  contain  portions  that  are  interval  graphs,  and  second,  it  is  easy  to  partition  interval  graphs 
into  clique  connected  components  that  can  be  colored  independently.  An  overall  coloring  of  clique 
connected  components  can  be  obtained  by  renaming  only;  re-computing  partial  results  is  not  necessary 
[Tarjan  85]. 

While  for  interval  analysis  data  locality  is  given  in  a  natural  way  by  the  loop  structure  of  the  input 
program,  finding  a  data  partitioning  for  global  register  allocation  is  not  straightforward.  The  focus  of  the 
thesis  research  in  parallel  global  register  is  the  development  of  a  structured  model  for  data  partitioning 
based  on  clique  separators  in  the  register  conflict  graph.  Knowledge  about  program  structure  is  used  to 
detect  interval  portions  of  the  register  conflict  graph  in  which  clique  separators  can  be  found 
systematically.  Non-imerval  portions  of  the  register  conflict  graph  are  manipulated  such  that  the  derived 
graph  is  an  interval  "raph.  Once  ihe  interval  portions  of  a  register  conflict  graph  are  known,  partitioning 
the  conflict  graph  via  clique  separators  and  the  parallelization  are  straightforward. 

The  evaluation  of  our  method  to  detect  clique  separators  in  register  conflict  graphs  focusses  on 
determining  whether  the  model  is  powerful  enough  to  detect  enough  clique  separators  in  the  register 
conflict  graphs  of  real  programs. 

1.6.  Approaches  to  parallel  compilation 

The  coarsest  level  of  parallelism  in  compilation  is  parallel  system  building.  Prerequisite  for  parallel 
system  building  is  separate  compilation.  A  system  that  consists  of  several  modules  is  composed  to  a  runfile 
at  link  time  -  each  moule  can  be  compiled  in  parallel  before  linking.  Optimistic  make  [BubZwaen  92],  a 
system  that  distributes  the  compilation  of  individual  modules  applies  this  level  of  parallelism 

The  compiler  used  in  parallel  system  building  is  still  sequential  -  the  next  level  of  parallelism  in 
compilation  is  to  turn  a  sequential  compiler  into  a  parallel  program.  The  coarsest  units  of  parallelism  are 
procedures  and  functions.  If  procedures  and  functions  are  to  be  compiled  independently,  inter  procedural 
optimizations  have  to  be  curtailed.  This  type  of  parallelization  has  been  studied  before  [Frankel 
83,  GrossZobel  89];  it  requires  the  compiler  driver  to  dispatch  the  (sequential)  compilation  of  each  function 
or  procedure.  This  approach  is  very  successful  when  inter  procedural  optimizations  are  curtailed  because 
then  there  are  no  dependencies  between  individual  parallel  tasks  Note  that  the  core  of  the  compiler  still 
runs  sequentially. 

The  next  level  of  parallelism  is  achieved  by  compiling  units  that  are  smaller  than  procedures  in  parallel. 
Many  research  efforts  focus  on  parallel  parsing  [Boehm  87,  Klein  90,  Fischer  75],  the  compilation  phase 
that  is  formally  best  understood.  Both  the  parallel  compiler  developed  at  the  University  of  Toronto 
[Seshadri  et  al  88]  and  the  parallel  compiler  described  in  [Vandevoorde  88]  are  examples  of  parallel  one 
pass  compilers.  The  input  program  is  divided  into  components  during  parsing,  and  each  component  is 
processed  independently  through  code  generation.  This  method  works  well  in  practice  for  non-optimizing 
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compilers  or  compilers  that  apply  local  optimizations  only,  i  or  global  optimizations,  interactions  between 
any  parts  of  a  program  must  be  considered.  Partitioning  the  input  program  at  parsing  time  makes  it  hard  to 
predict  the  communication  traffic  between  the  components  during  the  global  optimization  phase,  so 
potential  parallel  speedup  is  offset  by  interactions  between  the  components  during  global  optimizations. 

To  dale,  the  parallelization  of  global  optimizations  has  received  little  attention.  Some  efforts  have 
conentrated  on  the  parallelization  of  global  data  flow  analysis  [LeeMarloweRyder  91,GuptaPollockSoffa 
90],  and  approaches  to  perform  global  register  allocation  hierarchically  or  incrementally  are  indirectly 
related  to  detecting  dam  parallelism  in  the  problem  [GupSofSte  89,  CallahanKoblenz  91]. 

What  distinguishes  this  thesis  from  other  approaches  to  the  parallelization  of  global  compiler 
optimizations  is  the  development  of  models  that  relate  program  structure  to  data  parallelism.  Main 
conHv  mon  of  this  thesis  is  the  development  of  a  structured  model  for  detecting  data  parallelism  in  global 
register  allocation.  The  novelty  of  the  approach  taken  in  the  thesis  is  that  program  structure  is  used 
systematically  rather  than  heuristically.  The  analysis  of  boundary  conditions  between  components  that  are 
processed  independently  ensures  that  partial  results  are  combined  without  compromising  optimality  of  the 
overall  result 

1.6.1.  Organization  of  the  thesis 

The  remainder  of  this  thesis  is  organized  as  follows.  Our  model  of  compilation  and  jasic  definitions  are 
given  in  Chapter  2.  The  parallelization  of  interval  analysis  and  the  evaluation  of  a  parallel  implementation 
are  presented  in  Chapter  3.  The  bulk  of  the  thesis  is  devoted  to  the  model  for  structured  global  register 
allocation.  Terminology  and  basic  definitions  of  the  model  are  given  in  Chapter  4.  In  Chapter  5  we 
demonstrate  techniques  that  allow  us  to  detect  which  portions  of  a  register  conflict  graph  are  interval 
graphs.  Manipulations  that  can  change  a  non-interval  register  conflict  graph  into  an  interval  register 
conflict  graph  are  introduced  in  Chapter  6.  The  evaluation  of  the  model  for  structured  global  register 
allocation  is  given  in  Chapter  7,  and  we  conclude  by  discussing  the  contributions  in  Chapter  8. 
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Chapter  2 
Background 


Goal  of  this  thesis  is  to  investigate  methods  to  parallelize  global  compiler  optimizations.  We  restrict  our 
input  programs  to  well  structured  programs  built  from  a  set  of  compound  programming  constructs.  In  this 
chapter  we  introduce  our  model  of  compilation,  and  that  give  definitions  of  compound  programming 
constructs  and  well  structured  programs.  The  definitions  given  in  this  chapter  will  be  used  throughout  the 
remainder  of  the  thesis. 


2.1.  A  model  for  compilation 

A  compiler  consists  of  one  or  several  phases,  some  of  which  are  formally  better  understood  than  others. 
Before  going  into  detail  about  the  problems  of  parallel  optimization,  we  present  an  "abstract"  model  of 
compilation.  This  abstract  model  contains  all  important  compilation  phases  and  illustrates  the 
dependencies  between  the  phases.  A  compiler  implementation  consists  of  a  combination  of  the  phases  in 
the  abstract  model.  An  enumeration  of  the  phases  in  our  compilation  model  is  listed  below. 

Parsing  and  semantic  checking 

Derive  the  syntax  tree  from  the  input  and  check  for  syntactic  and  semantic  errors  in  the 
program. 

Intermediate  code  generation 

Derive  the  intermediate  language  representation  of  the  program  from  the  syntax  tree. 
Partition  the  program  into  basic  blocks  and  construct  the  program  flow  graph. 

Local  data  flow  analysis 

Perform  data  flow  analysis  within  the  basic  blocks  of  the  program.  local  data  flow 
analysis  is  necessary  to  perform  local  optimizations. 

Local  optimizations  Perform  optimizations  within  basic  blocks. 

Global  data  flow  analysis 

Compute  infon  tr.tion  about  the  flow  of  data  in  the  entire  program,  i.e.  across  basic 
block  boundaries. 

Global  optimizations 

Optimize  the  code  across  basic  block  sider  the  flow  of  control  of  the  entire 

program.  Global  optimizations  inclu  i  dural  optimizations. 

Code  generation  Generate  code  from  the  (possibly  opt  ^mediate  language  representation  and 

convert  the  intermediate  language  representation  into  assembly  code. 

Assembly  Produce  machine  code  from  assembly  code;  perform  peephole  optimizations. 

Figure  2-1  gives  a  view  of  the  compilation  model.  In  this  figure,  the  boxes  represent  compilation  phases, 
and  arcs  between  the  boxes  depict  data  dependencies  between  the  phases.  Every  compilation  starts  with 
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Figure  2-1:  The  compilation  model 

parsing  ,tnd  intermediate  language  generation.  Non-optimizing  compilers  enter  code  generation  directly 
after  parsing  and  semantic  checking.  Data  flow  analysis  and  optimization  occur  after  intermediate  code 
generation  and  before  code  generation.  Optimizing  compilers  can  be  classified  in  one  pass  compilers  and 
multiple  pass  compilers.  One  pass  compilers  perform  each  stage  of  the  compilation  exactly  once  while 
multiple  pass  compilers  loop  through  the  optimization  phases  several  times,  each  time  performing  more 
optimizations. 

The  shaded  are  of  Figure  2-1  depicts  the  parts  of  the  compilation  process  that  are  the  research  focus  of  the 
thesis. 
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2.2.  Input  model 

In  out  compilation  model,  global  optimizations  succeed  the  intermediate  code  generation  phase,  so  the 
input  program  has  been  divided  into  basic  blocks  and  the  program  flow  graph  has  been  constructed.  In 
other  words,  the  input  to  the  global  optimization  phase  are  flow  graphs  constructed  from  the  input  program. 
Throughout  this  thesis,  the  smallest  unit  of  parallelism  is  a  basic  block,  defined  as  follows: 

Definition  1:  (Basic  block)  A  basic  block  consists  of  a  nonempty  set  of  instructions  iv-  JK  such 
that  no  instruction  i-  e  [i j  }  is  a  jump  instructs  m.  The  last  instuction  in  can  be  but  need  not 
be  a  jump  instruction.  The  first  instruction  ij,  can  but  need  not  be  the  target  of  a  jump  instruction. 

Note  that  our  definition  of  basic  blocks  differs  from  the  most  commonly  used  definition  of  basic  blocks 
where  the  first  instruction  is  always  the  target  of  a  jump  instruction  and  the  last  instruction  is  always  a  jump 
instruction. 

Before  we  give  our  definition  of  a  well  structured  flow  grap^  we  define  individual  programming 
constructs  that  occur  in  well  structured  flow  graphs. 

Note  that  every  basic  block  consists  of  a  straight  line  code  sequence,  and  that  basic  blocks  can  consist  of 
as  few  as  one  instruction.  Because  we  do  not  require  that  the  last  instruction  of  a  basic  block  be  a  jump 
instruction,  we  define  straight  line  code  in  terms  of  basic  block  sequences. 

Definition  2:  (Straight  line  code)  Straight  line  code  is  a  directed  graph  consisting  of  basic 
blocks  b1,...J>n  such  that  all  6-e  (bl  are  connected  and  each  £>-e  {bv...Jbn}  has  exactly  one 
incoming  and  one  outgoing  edge. 

Definition  3:  (Loop)  A  loop  is  a  directed  graph  consisting  of  a  loop  head  h,  a  loop  exit  e  and  a 
loop  body.  The  loop  body  can  consist  of  straight  line  code,  a  conditional  branch  or  a  loop.  A 
cycle  in  the  loop  contains  h  iff  it  contains  e. 

Figure  2-2  shows  some  examples  of  loops.  The  leftmost  construct  is  a  nested  loop,  the  loop  in  the  middle 
consists  of  a  single  basic  block,  and  the  rightmost  loop  consists  of  a  conditional  statement 


Figure  2-2:  Examples  of  loop  constructs 

Definition  4:  (Conditional  branch)  A  conditional  branch  is  a  directed  graph  consisting  of  a 
split  node,  a  join  node  and  a  set  of  branch  clauses.  A  branch  clause  consists  of  either  straight  line 
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code,  a  loop  or  a  branch.  No  computations  are  performed  in  either  the  split  or  the  join  node. 
Each  branch  clause  contains  a  node  s  such  that  there  is  an  edge  from  the  split  node  to  s.  Each 
branch  clause  contains  a  node  e  such  that  there  is  an  edge  from  e  to  the  join  node. 

The  split  node  and  the  join  node  are  "imaginatory"  basic  blocks  in  which  no  computation  takes  place. 
Given  a  node  n  in  a  flow  graph  that  has  more  than  one  outgoing  edges  that  are  not  backedges,  we  insert  a 
new  node  s  and  an  edge  from  n  to  r;  all  forward  edges  originating  at  n  originate  from  the  newly  introduced 
node  s.  Given  a  node  m  in  a  flow  graph  that  has  more  than  one  incoming  edges  that  are  not  backedges,  we 
insert  the  join  node  j,  re-direct  all  forward  edges  that  go  into  m  into  j  and  insert  an  edge  from  j  to  m. 
Because  no  computations  are  performed  in  s  and  j,  the  data  flow  information  does  not  change  for  any  basic 
block  in  the  original  flow  graph.  We  will  use  the  split  and  join  node  to  distinguish  between  live  ranges  of 
variables  that  occur  in  conditionals  in  our  model  for  global  register  allocation. 

Figure  2-3  shows  a  conditional  branch  with  split  node  SO',  join  node  B8'  and  two  branch  clauses.  The 
first  branch  clause  consists  of  a  conditional  branch  consisting  of  Bl,  B2,  B3  and  B4  with  inserted  split  and 
join  node  Bl’  and  B4‘ ,  the  second  branch  clause  consists  of  a  loop,  formed  by  B5,  B6  and  B7.  The  split  and 
join  nodes  have  been  inserted  into  the  original  flow  graph;  this  is  indicated  by  representing  them  as  circles. 


Figure  2-3:  A  conditional  branch 


Note  that  our  definition  of  conditional  branches  requires  one  unique  split  node  S  and  one  unique  join  node 
J  per  conditional  The  pair  <SJ>  uniquely  identifies  a  conditional.  Note  that  the  number  of  branch  clauses 
is  unlimited.  Thus,  a  case  statement  is  just  an  example  of  a  conditional  branch. 

We  now  give  the  formal  definition  of  a  well  structured  flow  graph. 
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Definition  5:  ( Well  structured  flow  graph)  A  well  structured  flow  graph  is  a  directed  graph 
consisting  of  basic  blocks  and  a  set  of  edges,  such  that 

1.  There  exists  one  unique  basic  block  called  the  program  entry  which  dominates  all  other 
basic  blocks  in  the  flow  graph. 

2.  There  exists  one  unique  basic  block  called  the  program  exit  which  has  no  ou'roing 
edges 

3.  The  graph  derived  from  the  flow  graph  by  removing  the  program  entry  and  the  program 
exit  and  all  edges  incident  to  those  nodes  consists  of  a  sequence  of  pieces  of  straight  line 
code,  loops  and  conditional  branches. 


23.  Chapter  summary 

We  have  introduced  our  model  of  compilation  and  defined  our  input  model  for  global  optimization.  We 
restrict  our  input  to  well  structured  flow  graphs  that  consist  of  straight  line  code,  loops  and  conditionals. 
Throughout  the  thesis,  we  refer  to  the  definitions  given  in  this  chapter  whenever  one  of  those  terms  is  used. 
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Chapter  3 

Parallel  Interval  Analysis 


In  this  chapter,  we  describe  the  development  of  a  parallel  framework  for  global  data  flow  analysis. 
Global  data  flow  analysi ,  is  a  prerequisite  for  all  global  compiler  optimizations.  The  purpose  of  global  data 
flow  analysis  is  to  colier;  information  about  the  flow  of  data  for  every  basic  block  in  a  program  flow  graph. 
Given  a  program  flow  gn  jh  with  n  basic  blocks,  the  time  complexity  of  solving  a  global  data  flow  problem 
is  0(n2).  From  a  theoretical  standpoint  this  is  small;  the  reason  why  global  data  flow  analysis  can  account 
for  a  considerable  portion  of  the  overall  compilation  time  is  that  global  data  flow  problems  must  be  solved 
many  times  in  the  course  of  global  optimization. 

Interval  analysis  is  a  framework  for  global  data  flow  analysis  in  which  the  basic  blocks  of  a  program  are 
processed  in  a  particular  order.  The  reason  why  interval  analysis  is  a  good  candidate  for  parallelization  is 
that  the  data  processed  during  interval  analysis  can  be  partitioned  into  independent  pieces  in  a  natural  way. 
The  data  flow  information  for  non-nested  loops  can  be  computed  independently  and  embedded  into  an 
overall  result  without  backtracking.  This  leads  to  a  straightforward  parallelization  of  interval  analysis:  the 
loop  structure  of  the  program  dictates  the  partitioning  of  the  data  for  the  parallelization.  A  number  of 
global  optimizations  that  operate  on  loops  of  a  program  fit  into  the  same  parallel  framework;  one  example 
is  loop  vectorization.  To  assess  the  effectiveness  of  this  simple  parallelization  we  implemented  parallel 
interval  analysis  for  the  solution  of  a  data  flow  equation  and  measured  the  parallel  speedup  for  a  set  of 
benchmark  functions. 

After  a  brief  introduction  of  interval  analysis,  we  describe  our  parallel  framework  for  interval  analysis 
and  its  implementation.  We  discuss  approximations  to  the  optimal  parallel  speedup  and  conclude  the 
chapter  with  the  description  of  our  implementation  and  the  presentation  of  measurements  of  the  observed 
speedup  for  a  benchmark  of  functions. 

3.1.  Introduction  to  global  data  flow  analysis 

The  purpose  of  global  data  flow  analysis  is  to  compute  global  information  about  variables  and 
expressions  in  a  program.  Global  data  flow  problems  are  formally  well  understood,  and  there  exist  a 
number  of  different  frameworks  for  the  computation  of  data  flow  information.  The  most  common 
techniques  for  global  data  flow  analysis  are  discussed  in  [Kennedy  81].  In  this  section  we  give  a  brief 
introduction  to  data  flow  analysis  and  interval  analysis.  We  illustrate  our  introduction  with  an  example  of  a 
data  flow  problem. 


15 


Data  flow  analysis  problems  can  be  divided  into  two  classes: 


Class  1  Given  a  basic  block  b  in  the  flow  graph,  what  can  happen  before  control  reaches  b,  that 

is,  what  definitions  can  affect  computations  at  b. 

Class  2  Given  a  basic  block  b  in  the  program,  what  can  happen  after  control  leaves  b,  that  is, 

what  uses  of  expressions  can  be  affected  by  computations  at  b. 


Class  1  problems  are  known  as  forward  flow  problems,  class  2  problems  are  called  backward  flow 
problems.  An  example  of  a  class  1  problem  is  availability  analysis,  a  typical  class  2  problem  is  liveness 
analysis. 

Global  data  flow  problems  can  be  formulated  as  a  set  of  data  flow  equations.  The  core  of  global  data 
flow  analysis  is  to  find  solutions  for  those  equations. 

The  simplest  approach  to  data  flow  analysis  is  to  iterate  through  the  nodes  of  the  flow  graph  applying  the 
appropriate  equations  until  no  changes  take  place.  If  the  number  of  nodes  in  the  flow  graph  is  n,  the 
iterative  algorithm  requires  0{r?)  steps  for  the  entire  computation  [Aho  84]. 

Interval  analysis  is  a  method  to  solve  data  flow  equations  that  takes  advantage  of  the  locality  of  the  flow 
information  in  loops.  Interval  analysis  consists  of  two  passes.  In  Pass  1  of  interval  analysis,  local 
information  for  each  basic  block  is  determined  in  a  form  suitable  for  solving  the  equations.  Once  the  local 
properties  of  all  basic  blocks  are  known,  the  second  pass  determines  the  interactions  with  other  basic 
blocks.  After  Pass  2  is  finished,  we  know  how  each  basic  block  is  affected  by  the  instructions  of  every 
other  basic  block. 

3 2.  Interval  analysis:  background 

Interval  analysis  operates  on  regions  or  intervals  of  a  flow  graph  in  a  specific  order.  Intervals  capture 
loops  in  a  flow  graph,  more  formally. 

Definition  1:  ( Interval)  Given  a  flow  graph  G  with  basic  blocks  B,  an  interval  in  G  is  defined  to 
be  a  set  of  basic  blocks  I<~B  with  the  following  properties: 

•  There  is  a  node  he  I,  called  the  head  of  I,  which  is  contained  in  every  path  from  a  block 
outside  /  to  a  block  within  I.  In  other  words,  /  is  a  single  entry  region. 

•  /  is  connected 

•  /-  {A}  is  cycle-free;  i.e.,  all  cycles  within  I  must  contain  h. 

The  order  in  which  nodes  are  added  to  an  interval  I  is  called  interval  order.  The  interval  order  is 
significant,  in  that  if  nodes  of  I  are  processed  in  interval  order,  a  particular  node  will  be  treated  only  after 
all  its  predecessors  have  been  processed. 

Intervals  correspond  to  loops  in  the  program.  The  intervals  of  the  original  flow  graph  represent  the 
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innermost  loops.  The  graph  derived  from  a  flow  graph  G  by  replacing  the  set  of  nodes  that  form  an  interval 
by  one  node  representing  the  interval  is  called  a  derived  flow  graph  of  G.  dore  formally: 

Definition  2:  (Derived flow  graph)  Given  a  flow  graph  G  with  initial  node  n0,  the  derived  flow 
graph  1(G)  is  the  following  graph: 

1.  The  nodes  of  1(G)  are  the  intervals  found  by  interval  partition  of  G. 

2.  If  /  and  K  are  two  intervals,  there  is  an  edge  from  J  to  K  in  1(G)  if  and  only  if  there  exist 
nodes  ny  e  J  and  nkE  K  such  that  nk  is  a  successor  of  rij  in  G. 

3.  The  initial  node  of  1(G)  is  the  interval  including  n0. 

Given  a  node  h  in  a  flow  graph  G,  the  interval  /  for  h  is  built  by  adding  all  basic  blocks  of  G  not  in  I 
whose  predecessors  are  already  in  the  interval  /;  the  detailed  algorithm  is  given  in  Figure  3-1. 


Input:  The  specified  head  h. 

Output:  max_intenal[h] 
begin 

I  :=  (A); 

while  3ie  ( S[I]-f)  such  that  P[z]  c  I 
do 

/  :=  <A); 
od; 

maxjJUerval(h)  :  -  I; 
end 


Figure  3-1:  Algorithm  to  construct  the  maximum  interval  for  a  given  head 


Figure  3-2  shows  a  flow  graph  -  the  shaded  areas  depict  the  initial  partition  into  interval';.  Note  how  each 
individual  loop  is  captured  in  one  interval. 

Interval  partition  is  repeated  until  the  derived  graph  consists  of  only  one  single  node,  called  the  limit  flow 
graph.  We  call  the  sequence  of  graphs  derived  by  repeated  interval  partition  that  starts  with  a  flow  graph  G 
the  derived  sequence  of  G. 

Irreducible  flow  graphs 

Flow  graphs,  for  which  a  limit  flow  graph  that  consists  of  only  one  node  does  not  exist,  are  called 
irreducible.  Flow  graphs,  whose  limit  flow  graph  consists  of  only  one  single  node,  are  called  reducible. 
Flow  graphs  of  we11  structured  programs  are  always  reducible,  and  there  are  techniques  for  changing 
irreducible  flow  grapas  into  reducible  flow  graphs  [Aho  84],  For  the  purpose  of  this  thesis  we  assume  well 
structured  programs,  hence  we  do  not  address  the  irreducibility  of  flow  graphs  in  the  remainder  of  this 
chapter. 

Definition  3:  (Derived  sequence  of  a  flow  graph):  Given  a  flow  graph  G,  the  sequence  of 
graphs  (Gq,G  is  called  the  derived  sequence  for  G  if 

*G  =  G0 

•  Gj  is  derived  by  interval  partition  from  Gj,}  V;e  { l,...,m} 

•  Gm  is  the  limit  flow  graph  for  G0 
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Figure  3-2:  A  flow  graph  and  its  initial  interval  partition 
first  derived  graph 


Figure  3-3:  Sequence  of  derived  graphs 


Figure  3-3  shows  the  derived  seq,-'nce  for  the  flow  graph  depicted  in  Figure  3-2.  The  shaded  regions  of 
the  flow  graph  graph  shown  in  Figure  3-2  depict  the  intervals  after  the  first  interval  partition,  which  are  the 
nodes  of  the  first  derived  flow  graph.  The  second  generation  of  interval  partition  yields  the  limit  flow 
graph. 

Interval  partition  gives  rise  to  a  two-pass  algorithm  for  data  flow  analysis.  In  the  following,  the  method  is 
discussed  as  it  applies  to  an  availability  system 
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32.1.  Availability  analysis 

An  expression  e  is  said  to  be  available  at  a  basic  block  b  in  the  program  if  it  has  been  computed  before 
control  reaches  b,  and  none  of  its  operands  have  been  destroyed  between  its  computation  and  b.  Local 
availability  means  availability  within  one  basic  block,  global  availability  means  availability  within  the 
entire  flow  graph. 


An  expression  e  is  redundant  at  point  p  if  e  is  available  on  all  paths  that  lead  to  p.  Availability  is  one  of 
the  key  problems  in  program  optimization  because  it  allows  to  identify  redundant  expressions  which  may 
be  eliminated  from  the  program. 

We  now  introduce  an  availability  system  that  was  developed  by  Morel  and  Renvoise  [MorelRenvoise  81]. 
We  first  explain  the  local  properties  of  basic  blocks  which  are  needed  to  solve  the  availability  system. 
Up-transparency  UTRANSP 

A  block  is  said  to  be  up-transparent  for  an  expression  if  the  block  does  not  contain  any 
modification  of  the  operands  of  the  expression,  or  if  the  first  modification  of  an 
operand  of  the  expression  occurring  in  the  block  is  preceded  by  a  computation  of  the 
expression. 

Down-transparency  DTRANSP 

A  block  is  said  to  be  down-transparent  for  an  expression  if  the  block  does  not  contain 
any  modification  of  the  operands  of  the  expression,  or  if  the  last  modification  of  an 
operand  of  the  expression  occurring  in  the  block  is  followed  by  a  computation  of  the 
expression. 

Local  availability  COMP 

An  expression  e  is  said  to  be  locally  available  in  a  block  i  if  there  is  at  least  one 
computation  of  the  expression  in  the  block  i,  and  if  the  instructions  appearing  in  the 
block  after  the  last  computation  of  the  expression  do  not  modify  e's  operands. 

We  will  use  AVINi  to  denote  global  availability  upon  entry  of  block  i  and  AVOVTi  to  denote  global 
availability  upon  exit  from  block  i.  Further,  boolean  conjunctions  are  denoted  •  and  ,  disjunctions  +  and 
X  respectively.  In  the  following,  let  B  denote  the  set  of  all  basic  blocks. 

In  this  notation,  an  expression  is  available  on  entry  to  a  block  if  it  is  available  on  exit  from  each 
predecessor  of  the  block.  An  expression  is  available  on  exit  from  a  block  if  it  is  locally  available,  or  if  it  is 
available  on  entry  to  the  block  and  down-transparent  in  this  block.  This  leads  to  the  following  set  of 
equations  for  global  availability: 

AVOUT=COMP  ^DTRANSP* AVINi 
WN/Tlj'PrvmMOinyeB 


32.2.  Pass  1  of  interval  analysis 

During  the  first  pass,  local  quantities  COMP ,  DTRANSP  and  UTRANSP  are  computed  for  larger  and 
larger  regions  of  the  program.  The  algorithm  II  depicted  in  Figure  3-4  computes  COMP,  DTRANSP  and 
UTRANSP  for  an  interval  from  their  values  for  blocks  in  the  interval  If  G0,  G},  ...  ,  Gm  is  the  derived 
sequence  of  flow  graph  G  =  G0,  Pass  1  consists  of  applying  algorithm  II  to  each  interval  in  G0,  then  to  each 
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Input: 

Output: 

Miscellaneous: 

Method: 


an  interval  /  with  head  h 

COMP ,  DTRANSP x  and  UTRANSP x  for  each  xe  I 


COMP r  DTRANSP,  and  UTRANSP, 
Succ(x)  are  the  successors  of  x 


begin 

COMPj=COMPk; 

DTRANSP f=DTRANSPk ; 

UTRANSP j=UTRANSPk; 

for  all  xe  I-[h)  do 
for  all  ye  5uec(x)  do 

COMP^(COMP,eUTRANSPj^rCOMP) 

od; 

od; 

for  all  ze  /-  (A)  do 
DTRANSP; = 

(DTRANSP ,+UTRANSP ^DTRANSP t 
UTRANSP  j= 

(UTRANSP fDTRANSP ^UTRANSP x 

od; 

end 


Figure 3-4:  Algorithm// 


interval  in  G,,  and  so  on.  When  Gm.;  is  reached,  COMP ,  DTRANSP  and  UTRANSP  will  have  been 
computed  for  each  node  in  the  derived  sequence  of  graphs. 

3 23.  Pass  2  of  interval  analysis 


Input: 

Output: 

Miscellaneous- 

Method: 

begin 


an  interval  l  with  head  h 
AVJN  and  AVOUT  for  all  members  of  / 
Pred(x)  are  predecessors  of  x 


for  all  xe  /  in  interval  order,  starting  with  the  head  h  do 


Pred(x)AVCmr 

AVOUT  =C0MP+DTRANSP*AV7N. 


od; 


;nd 


Figure  3-5:  Algorithm  12 


During  the  second  pass,  AVIN  and  AVOUT  are  computed  for  smaller  and  smaller  regions  of  the  program. 
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If  x *  denotes  the  single  node  in  Gm,  Pass  2  begins  with  the  assignment  AVINX •  =  TRUE.  The  remainder  of 
the  pass  consists  of  repeated  application  of  algorithm  12  shown  in  Figure  3-5,  which  computes  AV1N  and 
AVOUT  for  each  node  in  an  interval  /,  given  correct  AMN  and  AVOUT  sets  for  the  entry  to  /  and  to  each 
successor  J  of  I.  12  is  applied  first  to  the  interval  of  Gm,  then  to  the  intervals  of  Gm  l  and  so  on  until  AVIN 
and  AVOUT  have  been  computed  for  every  node  in  the  original  graph  G. 

33.  Parallel  interval  analysis 

The  key  idea  in  our  parallelization  of  interval  analysis  is  that  disjoint  intervals  contain  disjoint  sets  of 
basic  blocks  and  therefore  may  be  treated  independently.  Another  important  observation  is  that  an  interval 
Ij  in  the  i-th  derived  flow  graph  depends  only  on  a  subset  S  of  the  basic  blocks.  Thus,  not  only  the  nodes  of 
the  i-th  derived  graph  Gi  but  all  nodes  in  G0,  Gj, ....  G-  independent  of  S  may  be  processed  in  parallel  with 
interval  /■. 

Recall  that  during  Pass  1  algorithm  II  is  applied  first  to  the  nodes  of  the  original  flow  graph,  then  to  the 
nodes  of  the  first  derived  flow  graph  etc.  Thus,  the  level  of  parallelism  decreases  as  we  progress  through 
the  sequence  of  derived  flow  graphs.  In  Pass  2,  algorithm  12  is  first  applied  to  the  limit  flow  graph,  then  to 
the  previous  derived  flow  graph  etc.  Therefore  the  level  of  parallelism  ir.  Pass  2  increases  as  we  progress 
through  the  (reversed)  sequence  of  derived  flow  graphs. 

The  order  in  which  the  nodes  of  the  derived  sequence  are  processed  corresponds  to  a  postorder  traversal 
of  a  tree  composed  of  the  nodes  in  the  graphs  of  the  derived  sequence,  called  complete  interval  tree. 

Definition  4:  (Complete  interval  tree )  Given  a  program  flow  graph  G  and  its  derived  sequence 
{G0,G j,...,Gm] ,  the  complete  interval  tree  of  G  is  a  tree  with  the  following  properties: 

•  The  leaves  of  the  tree  are  basic  blocks. 

•  Every  node  in  the  tree  corresponds  to  an  interval  in  a  graph  in  the  derived  sequence 
{G0,G1,...,GJ. 

•  There  is  an  edge  from  node  J  to  node  K  iff  the  interval  J  of  derived  flow  graph  Gm 
contains  the  interval  K  of  the  previous  derived  flow  graph  Gm.;. 

•  The  root  of  the  tree  is  Gm,  the  limit  flow  graph. 


Eliminating  redundant  nodes  from  the  tree 

Figure  3-7  shows  a  complete  interval  tree  in  which  each  node  n  has  an  associated  weight  that  equals  the 
number  of  leaf  nodes  in  the  subtree  rooted  by  n.  The  weight  of  a  node  corresponding  to  an  interval  is 
therefore  the  number  of  basic  blocks  in  that  interval.  The  intervals  of  the  i+7st  derived  flow  graph  are 
composed  of  the  nodes  of  the  ith  derived  flow  graph. 

The  sequence  of  the  derived  flow  graphs  reflects  the  loop  structure  of  a  program:  the  first  derived  flow 
graph  reflects  the  innermost  loops,  the  second  level  loops  that  enclose  the  innermost  loops  etc.  So  the 
weight  of  intervals  containing  the  basic  blocks  that  form  loops  increases  with  the  index  of  the  derived  flow 
graph. 
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Looking  at  our  example  in  Figure  3-7,  the  associated  weight  of  nodes  in  the  leftmost  branch  of  the  tree 
does  not  change  until  the  root  Thus,  in  each  iteration  of  building  the  derived  flow  graphs  the  intervals 
represented  by  the  leftmost  nodes  in  the  tree  contain  the  same  one  basic  block.  Therefore,  the  data  flow 
information  for  these  nodes  does  not  change  from  one  iteration  to  the  next  This  suggests  the  following 
elimination  rule:  Nodes  of  the  i+1  si  derived  flow  graph  that  have  only  one  child  (i.e.  whose  weight  is  no 
larger  than  the  child’s  weight)  need  not  be  reconsidered  during  the  flow  analysis  pass  and  can  be  eliminated 
from  the  complete  interval  tree.  Figure  3-8  depicts  the  same  complete  interval  tree  after  all  nodes  that  do 
not  change  the  data  flow  analysis  information  have  been  eliminated.  Eliminating  redundant  nodes  from  the 
complete  interval  tree  means  to  avoid  unnecessary  overhead  of  scheduling  redundant  nodes  during 
processing. 

The  complete  interval  tree  is  th<  basis  of  our  parallelizaFon  of  interval  analysis.  Algorithm  II  is  applied 
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Figure  3-8:  Interval  tree  after  elimination 

to  the  complete  interval  tree  bottom  up,  starting  at  the  leaves  of  the  tree,  computing  the  COMP ,  DTRANSP 
and  UTRANSP  sets  of  the  program  for  bigger  and  bigger  regions  of  the  flow  graph.  Upon  completion  of 
algorithm  11,12  computes  the  AVIN  and  the  AVOUT  sets  of  variables  for  smaller  and  smaller  regions  of  the 
program,  starting  at  the  root  of  the  complete  interval  tree  (which  denotes  the  entire  program)  and  ending  at 
the  leaves  of  the  tree. 

The  order  in  which  algorithms  II  and  12  have  to  be  applied  to  the  nodes  of  the  complete  interval  tree  is 
only  restricted  by  the  parent/child  relations  of  the  nodes.  In  Pass  1,  a  node  J  can  only  be  processed  after  all 
its  successors  have  been  processed.  In  Pass  2,  a  node  J  can  only  be  processed  after  all  its  parent  nodes  have 
been  processed.  Therefore,  all  nodes  that  are  not  ancestors  or  descendants  of  each  other  can  be  processed 
independently.  The  execution  of  nodes  that  are  in  ancestor  relationship  must  be  synchronized. 

3.4.  A  model  to  approximate  the  amount  of  parallelism  in  parallel  interval  analysis 

Assuming  that  the  processing  times  for  each  individual  node  are  known,  the  minimal  processing  time  for 
a  given  complete  interval  tree  is  bounded  by  the  longest  path  from  the  root  to  a  leaf  node,  where  the  length 
of  a  path  is  the  sum  of  the  processing  times  of  its  nodes,  more  formally: 

Definition  5:  (Length  of  a  path  in  a  complete  interval  tree)  Given  a  complete  interval  tree  T 
with  root  R,  nodes  N  and  processing  time  r(n)  for  all  nodes  n  e  N,  let  p  be  a  path  from  a  leaf  node 
to  the  root  R  consisting  of  N*  qN.  Then,  the  length  of  p  Lip)  is  defined  as 

MP)=  X  *(n) 

netr 

The  formal  definition  of  the  minimal  processing  time  is  then: 

Definition  6:  (Minimal  processing  time  for  a  complete  interval  tree)  Given  a  complete  interval 
tree  T  with  root  R,  let  P  be  the  set  of  all  paths  from  die  leaf  nodes  of  T  to  the  root  R.  The  minimal 
processing  time  for  T,  t^JT)  is  the  length  of  the  longest  path  among  P: 

fmJX)=max{Up)]peP) 

The  most  efficient  parallelization  processes  the  tree  in  minimal  time  and  utilizes  all  processing  elements 
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100%  of  the  time.  A  tree  structure  as  basis  for  parallelism  does  not  permit  100%  processor  utilization  if  the 
number  of  processing  elements  exceeds  1.  The  reason  is  that  the  root  node  can  not  be  processed  in  parallel 
with  any  other  node,  so  all  processors  except  one  are  idle  when  the  root  node  is  ready  for  processing. 

The  number  of  processors  that  can  be  used  efficiently  for  processing  a  complete  interval  tree  in  minimal 
time  depends  on  the  shape  of  the  complete  interval  tree.  Figure  3-9  shows  complete  interval  trees  of 
different  shapes. 


Figure  3-9:  Differently  shaped  complete  interval  trees 


The  three  examples  in  Figure  3-9  represent  the  spectrum  of  shapes  that  complete  interval  trees  can  have. 
The  tree  with  the  root  labeled  w  is  wide  and  flat  Recall  that  the  leaf  nodes  of  complete  interval  trees  are 
the  basic  blocks  of  the  program.  A  program  that  results  in  a  complete  interval  tree  of  that  form  contains  no 
loops.  A  program  that  consists  of  one  deep  nest  of  loops  results  in  a  complete  interval  tree  that  looks  like 
the  example  labeled  l  in  Figure  3-9:  the  innermost  loop  is  found  at  the  bottom  of  the  tree,  basic  blocks  that 
are  parts  of  the  enclosing  loops  are  accumulated  on  the  way  from  the  bottom  to  the  root  The  tree  labeled  p 
is  a  perfectly  balanced  binary  tree,  resulting  from  several  independent  loop  nests  in  the  program. 

Intuitively  trees  that  are  wide  and  bushy  can  keep  a  large  number  of  processors  busy;  trees  that  are  long 
and  thin  do  not  permit  much  parallelism.  It  is  easy  to  determine  the  minimal  processing  time  for  a  given 
complete  interval  tree  when  the  processing  time  for  each  node  is  known.  Given  an  unlimited  resource  of 
processors,  every  complete  interval  tree  can  be  processed  in  minimal  time.  Given  only  a  fixed  number  of 
processors,  it  is  hard  to  determine  a  priori  whether  a  given  tree  can  be  processed  in  minimal  time.  The 
following  paragraphs  address  the  problem  of  establishing  a  lower  bound  for  the  number  of  processors 
needed  to  process  a  complete  interval  tree  in  minimal  time.  A  formal  definition  of  this  bound  is  given 
below: 

Definition  7:  (Parallel  efficiency  of  a  complete  interval  tree)  Given  a  complete  interval  tree  T 
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with  minimal  processing  time  t mi>1,  the  parallel  efficiency  of  T  is  the  minimal  number  of 
processors  needed  to  process  T  in  t^. 

The  parallel  efficiency  of  a  complete  interval  tree  T  is  a  measure  for  the  "amount"  of  parallelism  in  T.  If 
the  parallel  efficiency  for  a  complete  interval  tree  T  is  known,  the  speedup  observed  in  a  parallel 
implementation  can  be  analyzed  appropriately.  If  for  example  the  parallel  efficiency  of  a  complete  interval 
tree  is  p,  and  p'>p  processors  operate  in  parallel,  there  exists  a  schedule  for  executing  the  nodes  in  the 
complete  interval  tree  with  p  processors  such  that  the  speedup  observed  with  p’  processors  does  not  exceed 
the  speedup  observed  with  p  processors.  The  reason  is  that  only  p  processors  can  be  used  efficiently,  any 
additional  processor  is  bound  to  be  idle  a  lot  of  the  time. 

We  start  the  discussion  by  showing  some  properties  of  complete  interval  trees  that  allow  us  to  establish 
an  upper  bound  on  the  parallel  efficiency  of  a  complete  interval  tree. 


3.4.1.  Properties  of  complete  interval  trees 

Basic  blocks  and  nodes  in  the  complete  interval  tree  are  the  smallest  units  processed  in  parallel.  Before 
we  reason  about  the  parallelism  in  the  tree,  we  state  some  properties  of  complete  interval  trees  for  reducible 
flow  graphs.  These  properties  permit  to  compute  certain  bounds  for  the  width  and  the  height  of  such  trees. 

Lemma  8:  Given  a  complete  interval  tree  T  in  which  all  redundant  nodes  have  been  eliminated 
by  the  elimination  rule,  every  node  in  T  that  is  not  a  leaf  node  has  at  least  2  children. 

Proof:  Each  node  in  the  complete  interval  tree  corresponds  to  either  a  basic  block  in  the  flow 
graph  of  a  program  or  to  an  interval  in  die  flow  graph.  Basic  blocks  form  leaf  nodes  in  the  tree 
and  have  no  children.  By  the  elimination  rule,  nodes  that  correspond  to  intervals  that  consist  of 
only  one  node  do  not  occur  in  T.  Therefore,  only  intervals  that  consist  of  at  least  two  nodes  are 
represented.  By  construction  of  complete  interval  trees,  such  nodes  must  have  at  least  two 
children. 

Lemma  9:  Given  a  complete  interval  tree  T  of  a  program  flow  graph  G  with  n  basic  blocks,  the 
height  of  T  is  at  most  n . 

Proof:  Induction  on  n,  the  number  of  basic  blocks.  n=l:  A  flow  graph  consisting  of  one  basic 
block  is  equivalent  to  the  corresponding  limit  flow  graph,  therefore  the  complete  interval  tree 
consists  only  of  one  node  and  has  height  1.  n->n+l :  Under  the  assumption  that  T s  height  is  at 
most  n  let  G'  be  the  flow  graph  derived  from  G  by  adding  a  basic  block  ft  at  an  arbitrary  location. 

The  corresponding  interval  tree  T  can  differ  from  T  as  follows: 

1.  ft  is  part  of  an  inner  loop,  in  which  case  the  interval  in  G  that  corresponds  to  this  loop 
will  consist  of  one  more  basic  block  -  the  height  of  T  equals  the  height  of  T 

2.  ft  starts  a  new  interval  and  therefore  T  can  have  at  most  one  more  node  -  therefore  the 
height  of  T  can  be  at  most  the  height  of  7+1  =  n+1 

Lemma  10:  Given  a  complete  interval  tree  T  of  a  program  flow  graph  G  with  n  basic  blocks, 
the  number  of  nodes  in  T  is  at  most  2n- 1. 

Proof:  T  is  composed  of  basic  blocks  and  nodes  of  the  derived  flow  graphs.  Each  node  in  a 
derived  flow  graph  has  at  least  two  children,  otherwise  it  would  be  eliminated  by  the  elimination 
rule.  In  the  worst  case  there  can  be  at  most  one  node  in  any  derived  flow  graph  per  basic  block  - 
again  because  of  the  elimination  rule.  Further,  there  must  be  at  least  one  interval  that  contains 
two  basic  blocks.  If  the  number  of  basic  blocks  is  n,  the  total  number  of  interval  nodes  is 
bounded  by  n-1 ,  hence  the  total  number  of  nodes  in  T  is  bounded  by  2n-l. 


During  the  processing  of  a  complete  interval  tree  T,  only  a  subset  of  nodes  are  executable  at  any  given 
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dme.  In  the  beginning,  only  leaf  nodes  are  executable;  an  inner  node  n  in  the  complete  interval  tree 
becomes  executable  once  all  the  nodes  in  the  subtree  rooted  at  n  have  been  processed.  Before  we  compute 
an  upper  bound  on  the  number  of  nodes  in  a  complete  interval  tree  that  can  be  executable  simultaneously, 
we  give  a  formal  definition  for  that  quantity. 

Definition  11:  (PARSET  of  a  tree )  Given  a  complete  interval  tree  T  with  minimal  processing 
time  /„  let  t  <  r  • .  PARSET(t)  is  the  set  of  nodes  in  T  that  are  executable  at  time  t. 

mui  nun 

Note  that  the  size  of  the  PARSET  of  a  complete  interval  tree  is  independent  of  the  number  of  processors. 
In  the  next  lemma,  we  state  that  the  PARSET  of  a  complete  interval  tree  can  not  exceed  a  certain  number. 

Lemma  12:  Given  a  complete  interval  tree  T  of  a  program  flow  graph  with  n  basic  blocks  with 
minimal  processing  time  t^,  PARSET(T)<n. 

Proof:  Case  1 :  PARSET  consists  exclusively  of  basic  blocks.  In  that  case  \PARSET(t)\  is 
certainly  at  most  n. 

Case  2:  PARSET  consists  of  both  interval  nodes  and  basic  blocks.  Let  K  be  an  arbitrary  node  in 
PARSET(t).  Key  observation  in  the  proof  is  that  while  K  is  processed,  PARSET (t)  can  not  contain 
any  node  in  K’s  subtree  or  any  node  on  the  path  from  K  to  the  root  Let  PARSET(t)  contain  kt 
basic  blocks,  k 2  nodes  of  the  first  derived  flow  graph, ....  kt  nodes  of  the  last  derived  flow  graph. 

By  Lemmas  9  and  10,  the  number  of  nodes  in  the  subtree  of  a  node  of  thej-th  derived  flow  graph 
is  at  least  2 i  and  the  number  of  nodes  along  the  path  from  that  node  is  n-j.  Thus  the  number  of 
nodes  in  PARSET(t)  contains  at  most  (2n  - 1)  -  (*,x  2 i+n-j)<n  q.e.d. 

As  a  result  of  Lemma  12,  our  parallelization  can  keep  at  most  n  processors  busy  simultaneously  if  n  is  the 
number  of  leaf  nodes  in  the  complete  interval  tree.  There  are  cases  in  which  the  complete  interval  tree  can 
be  processed  in  minimal  time  with  fewer  than  n  processors,  where  n  is  the  number  of  leaf  nodes  of  the 
complete  interval  tree.  This  is  illustrated  in  Figure  3-10.  Under  the  simplifying  assumption  that  every  node 


Figure  3-10:  Minimal  time  with  fewer  processors  than  leaf  nodes 
requires  the  same  processing  time,  the  longest  path  from  the  root  to  a  leaf  node  consists  of  nodes  { 1,3, 5, 7} 
and  therefore  the  minimal  processing  time  for  this  tree  is  4.  The  number  of  leaf  nodes  is  4,  but  the  tree  can 
be  processed  in  minimal  time  with  just  2  processors;  this  is  achieved  by  processing  nodes  {6,7},  then  {4,5}, 
then  {2,3}  and  finally  the  root,  { 1 } .  The  parallel  efficiency  of  that  tree  is  2;  we  will  show  in  the  following 
sections  that  computing  the  parallel  efficiency  for  a  given  interval  tree  is  NP  complete;  we  conclude  our 
theoretical  analysis  by  computing  upper  and  lower  bounds  for  the  parallel  efficiency  of  complete  interval 


trees. 
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3.4.2.  Bounds  on  the  parallel  efficiency  of  complete  interval  trees 

To  be  able  to  compute  the  parallel  efficiency  of  a  given  interval  tree,  we  must  know  the  processing  times 
for  all  nodes.  Usually,  the  processing  times  for  different  nodes  in  a  realistic  complete  interval  tree  vary 
considerably.  In  the  following,  we  assume  that  for  every  node  in  the  complete  interval  tree  the  computation 
time  is  known. 

The  next  lemma  states  that  determining  the  parallel  efficiency  of  an  arbitrary  complete  interval  tree  is  NP 
complete.  We  will  show  this  by  reducing  optimal  binpacking  [Garey,  M.R.  and  Johnson,  D.S.  79]  to 
computing  the  parallel  efficiency. 

Lemma  13:  Given  an  arbitrary  complete  interval  tree  T,  it  is  NP  complete  to  determine  the 
parallel  efficiency  of  T. 

Proof:  Let  be  an  arbitrary  sequence  of  packets.  The  bin  packing  problem  consists 

of  finding  the  minimal  number  of  bins  with  capacity  c  that  hold  {w'1,...,wfl} .  Given  that  sequence 
of  packets,  we  construct  a  complete  interval  tree  as  follows:  the  leaves  of  tree  are  basic  blocks 
such  that  the  execution  time  for  b—w^ie  (l,...,n)  and  the  execution  time  for 
b„+1=c,  such  that  c>wi,Vie  (l,.../ij.  The  root  of  the  complete  interval  tree  consists  of  a  node 
with  an  arbitrary  execution  time,  warl>.  The  minimal  execution  time  for  that  tree  is  then  c+warb, 
and  the  parallel  efficiency  is  equal  to  1  +  (the  minimal  number  of  bins  of  capacity  c  needed  to 
pack  {h'1,...,h'J).  Hence,  finding  the  parallel  efficiency  of  arbitrary  complete  interval  trees  is  NP 
complete. 

If  the  execution  time  for  every  node  in  a  complete  interval  tree  is  known,  the  length  of  the  longest  path 
from  the  root  to  a  leaf  node  in  the  tree  is  also  known.  To  process  the  complete  interval  tree  in  minimal 
time,  there  is  a  deadline  for  every  node  at  which  that  node  must  be  scheduled.  Given  the  execution  time 
and  the  deadline  for  every  node,  each  node  can  be  mapped  to  a  2  b’ple  suc^  that  li  is  the  node’s 
deadline  and  r2=f  j  +  the  node’s  execution  time. 

There  are  numerous  ways  to  schedule  the  nodes  in  a  complete  interval  tree  such  that  the  tree  is  processed 
in  minimal  time.  A  schedule  consists  of  a  mapping  of  nodes  of  the  complete  interval  tree  to  a  tuple  that 
consists  of  processor  and  time,  more  formally: 

Definition  14:  ( Schedule )  Given  a  set  of  processors  P,  a  complete  interval  tree  with  nodes  N, 
and  minimal  execution  time  t^,  the  function  S  with 

S:Nx  [0^]  ->  (P  x  [(VmJ } 

S(nj)=(p,t„) 

such  that  n  is  executable  at  time  t  and  tn  S  d(n),  where  d(n)  is  the  deadline  for  node  n  is  called  a 
schedule  for  T. 

The  simplest  schedule  that  allows  to  process  a  complete  interval  tree  in  minimal  time  is  obtained  by 
scheduling  every  node  as  soon  as  it  is  executable.  Thus,  all  basic  blocks  are  scheduled  at  the  beginning, 
each  on  a  distinct  processor.  This  leads  to  our  first  upper  bound  for  the  parallel  efficiency  of  a  complete 
interval  tree,  introduced  in  the  next  lemma. 

Lemma  15:  Given  a  complete  interval  tree  T  with  b  leaf  nodes,  the  parallel  efficiency  of  T  is  at 
mostb. 

Proof:  By  Lemma  12,  PARSET(t)<bVt.  Therefore,  given  b  processors,  there  is  always  a 
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distinct  processor  for  each  executable  node  in  T,  and  T  can  be  processed  in  minimal  time. 
Therefore,  b  is  an  upper  bound  on  the  parallel  efficiency  of  T. 

It  is  easy  to  see  that  only  for  a  complete  interval  tree  T  that  is  wide  and  flat,  the  number  of  leaf  nodes  is  a 
reasonable  approximation  of  Ts  parallel  efficiency.  For  some  trees,  fewer  processors  are  needed  for 
minimal  processing  time  when  every  node  is  scheduled  at  its  deadline,  rather  than  when  it  is  executable.  In 
such  a  schedule,  the  number  of  processors  needed  is  equal  to  the  number  of  nodes  in  the  tree  whose 
execution  interval  overlaps. 

Definition  16:  (Execution  interval  of  a  node)  Given  a  node  n  in  a  complete  interval  tree  with 
deadline  td  and  execution  time  te,  we  say  that  we  say  that  is  the  execution  interval  of  n. 

Lemma  17:  Given  a  complete  interval  tree  T  with  nodes  A\  the  parallel  efficiency  of  T  is 
bounded  by  the  largest  subset  S<zN  such  that  the  execution  intervals  of  the  nodes  in  S  overlap. 

Proof:  Suppose  that  the  size  of  the  largest  subset  S  c  N  is  p,  but  the  parallel  efficiency  of  T  is 
p’>p.  If  every  node  in  T  is  scheduled  at  its  deadline,  certainly  T  is  processed  in  minimal  time. 
Hence,  there  is  a  schedule  using  p  processors  in  which  T  can  be  processed  in  minimal  time. 
Hence,  the  parallel  efficiency  of  T  is  at  most  p  -  a  contradiction. 

We  have  seen  that  it  is  hard  to  predict  how  many  parallel  processors  can  be  used  efficiently,  even  if 
implementation  specific  parameters  are  ignored.  In  a  parallel  implementation,  system  parameters  can  not 
be  ignored,  and  even  if  a  complete  interval  tree  has  a  large  parallel  efficiency  in  reality  only  a  number  of 
processors  that  is  less  than  the  parallel  efficiency  can  be  used  effectively.  We  have  implemented  parallel 
interval  analysis  to  assess  how  many  processors  can  be  used  in  parallel  in  practice. 

3.5.  Implementation:  a  test  case 

The  goals  of  our  implementation  of  parallel  interval  analysis  were  twofold.  First,  we  wanted  to  assess  the 
suitability  of  a  parallelization  based  on  a  program’s  explicit  loop  structure.  From  the  previous  section  it  is 
clear  that  optimal  parallel  speedup  for  parallel  interval  analysis  is  hard  to  predict,  even  when  system 
parameters  are  ignored.  Finding  a  perfect  schedule  to  process  a  given  complete  interval  tree  in  parallel 
requires  expensive  analysis.  Finding  such  a  good  schedule  does  not  pay  off  in  practice  if  the  theoretically 
optimal  speedup  is  cancelled  by  implementation  and  system  parameters.  Further,  it  is  hard  to  predict  the 
processing  time  of  individual  nodes  in  a  complete  interval  tree  accurately.  Therefore,  the  second  goal  of 
our  implementation  was  to  assess  whether  a  simple  approach  to  scheduling  the  nodes  of  a  complete  interval 
tree  suffices  in  practice. 

3.5.1.  Implementation  details 

Our  program  for  parallel  interval  analysis  runs  as  C  process  under  the  Mach  operating  system  on  an 
Encore  Multimax  system  with  14  parallel  processors  [Multimax  88]. 

Even  though  our  algorithm  could  be  easily  integrated  into  a  production  compiler,  we  wanted  to  study  the 
effects  of  parallel  interval  analysis  in  isolation.  We  chose  to  build  an  interface  to  a  production  compiler  to 
be  able  to  obtain  the  flow  graphs  of  real  user  programs  without  having  to  deal  with  the  compiler’s  front  end 
and  back  end. 
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The  GNU  C  compiler  developed  at  the  Free  Software  Foundation  [Stallman  88]  served  as  "host 
compiler".  A  swatch  in  the  GNU  C  compiler  causes  the  RTL  representation  of  an  input  program  to  be 
written  to  a  file.  Reading  the  RTL  representation  from  that  file  and  re-constructing  the  program’s  flow 
graph  makes  our  implementation  completely  modular  the  basic  blocks  in  the  flow  graph  can  be  globally 
optimized,  and  the  optimized  RTL  representation  can  be  the  input  to  the  back  end  of  the  GNU  C  compiler 
in  a  separate  phase.  We  therefore  can  focus  on  the  parallelization  of  interval  analysis  but  are  able  to 
combine  our  parallel  implementation  of  interval  analysis  with  (possibly  parallelized)  front  ends  and  back 
ends  of  compilers  whose  intermediate  representation  is  RTL.  The  possibility  to  run  parallel  interval 
analysis  on  any  C  program  allows  to  assess  the  parallel  performance  for  a  realistic  set  of  test  cases. 
Parallelism  is  expressed  by  means  of  the  cthreads  library  [Cooper  88]. 

3 S 2.  The  scheduling  algorithm 

The  central  scheduling  construct  of  the  parallel  optimizer  is  a  ready  queue :  each  node  of  the  complete 
interval  tree  that  is  ready  to  be  processed  is  enqueued  in  the  ready  queue  in  arbitrary  order.  A  two  level 
hierarchy  of  threads  is  active  during  Stage3: 

Master  thread  The  master  thread  has  two  functions:  first,  it  sets  up  the  ready  queue  (initially  all  leaves 
of  the  complete  interval  tree  are  scheduled)  and  forks  a  fixed  number  of  server  threads. 
Second,  it  is  responsible  for  load  balancing:  the  master  thread  enqueues  the  next  node 
of  the  ready  queue  in  the  data  queue  of  the  server  thread  with  the  currently  lowest  work 
load. 

Server  threads  The  server  threads,  each  with  its  private  data  queue ,  execute  concurrently.  A  server 
processes  the  nodes  enqueued  in  its  data  queue  and  enqueues  new  nodes  into  the  ready 
queue  until  it  receives  a  termination  signal  from  the  master  thread.  Each  node  contains 
status  information.  Depending  on  that  status  information,  the  server  applies  either 
algorithm  11  (Pass  1)  or  algorithm  12  (Pass  2)  to  the  node.  During  Pass  1,  the  server 
notifies  the  parent  of  the  currently  processed  node  about  the  completion  of  the  child.  If 
all  children  have  finished  the  server  enqueues  the  parent  in  the  ready  queue.  During 
Pass  2,  the  server  enqueues  all  children  of  the  current  node  into  the  ready  queue. 

Each  thread  (the  master  thread  and  all  server  threads)  runs  on  a  separate  processor.  It  should  be  noted 
that  the  master  thread  is  merely  a  setup  and  load  balancing  agent  and  does  not  process  any  nodes  of  the 
complete  interval  tree.  Therefore,  the  best  possible  speedup  is  n-1  for  n  threads. 

3.6.  Results 

Recall  that  our  goal  was  to  investigate  how  useful  parallelism  based  on  explicit  program  structure  is  in 
practice,  so  we  measured  the  parallel  speedup  for  the  solution  of  only  one  data  flow  equation.  Multiple 
pass  optimizing  compilers  have  to  solve  numerous  data  flow  equations  while  the  shape  of  the  program 
remains  the  same.  Fo  a  parallel  implementation  this  means  that  a  substantial  part  of  the  parallel  overhead 
is  independent  of  the  number  of  equations  solved.  For  multiple  equations,  task  management, 
synchronization  and  scheduling  efforts  remain  the  same  while  the  (sequential)  work  performed  at  the  nodes 
of  the  complete  interval  tree  is  increased.  As  a  result,  better  speedup  can  be  anticipated  if  more  equations 
are  solved. 
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We  therefore  did  not  attempt  to  implement  a  tool  that  performs  exhaustive  data  flow  analysis  in  parallel 
but  restricted  our  implementation  to  the  availability  system  described  in  Section  2.  By  the  previous 
argument,  if  we  can  observe  good  speedup  for  that  one  equation,  our  approach  to  parallelizing  interval 
analysis  can  be  successfully  extended  to  global  optimization  problems  that  fit  into  the  same  framework. 

3.6.1.  The  benciuiuii  k  functions 

The  amount  of  parallelism  in  our  implementation  is  determined  by  the  shape  of  the  complete  interval  tree 
which  depends  mainly  on  the  size  of  the  flow  graph  and  the  loop  structure  of  the  input  program.  We 
therefore  used  benchmark  functions  that  differ  in  the  number  of  basic  blocks  and  contain  nested  loops,  a 
characteristic  that  is  met  by  many  scientific  programs.  We  used  a  set  of  five  C  benchmark  functions  from 
the  scientific  computing  domain  where  the  size  of  the  flow  graph  varied  between  30, 64, 176, 207  and  296 
nodes  (=  basic  blocks).  In  the  following  we  will  call  those  benchmark  functions  f30,  f^,  f176,  f2Q7  and  f296. 


3.6 2.  Experiment  description 

We  implemented  both  our  parallelization  and  a  standard  implementation  of  sequentrJ  interval  analysis 
based  on  the  algorithm  described  in  [Kennedy  81],  orthogonal  to  the  paraJW  aplementation.  It  therefore 
provides  a  basis  for  a  fair  comparison  with  our  implementation  of  parallel  inter,  al  analysis. 

For  all  functions,  we  treasured  both  the  parallel  and  the  sequential  execution  time.  For  the  parallel 
implementation,  we  varied  the  number  of  server  threads  (and  therefore  processors)  between  2,  4,  6,  8  and 
10.  Recall  that  the  master  thread  runs  on  a  separate  processor  but  is  not  involved  in  the  work  on  the 
complete  inte  .val  tree.  So  the  number  of  processors  varied  between  3,  5,  7,  9  and  1 1  but  the  best  possible 
speedup  varies  between  2,  4,  6,  8  and  10.  To  ensure  that  all  parallel  processors  are  dedicated  to  parallel 
interval  analysis,  we  used  the  allocate_processor  facility  provided  by  the  experimental  Mach  system 
running  on  our  Encore  Multimax.  This  facility  allows  to  allocate  a  fixed  number  of  processors  for  a  given 
amount  of  time  to  a  single  user,  possibly  delaying  the  user  until  the  system  load  allows  the  allocation.  No 
other  user  can  use  those  processors  during  that  time  interval.  This  facility  minimizes  the  interference  with 
other  processes  but  does  not  eliminate  it  completely. 

Each  test  was  run  multiple  times.  The  numbers  presented  in  this  paper  are  the  arithmetic  mean  of  those 
measurements.  The  deviations  of  the  individual  measurements  are  within  5%  of  the  average. 

3.6  J.  Measurements 

We  measured  the  speedup  of  parallel  interval  analysis  over  a  standard  sequential  implementation  of 
interval  analysis.  Before  the  servo-  threads  can  start  to  execute  in  parallel,  some  setup  work  has  to  be 
carried  out: 

1 .  Initialization  of  the  ready  queue  with  all  basic  blocks 

2.  Allocation  of  processors 
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3.  Forking  of  the  server  threads 

In  addition  tc  the  setup  time  and  the  time  actually  spent  processing  the  nodes  of  the  complete  interval  tree, 
the  parallel  execution  time  of  parallel  interval  analysis  accounts  for  task  scheduling,  task  synchronization 
and  load  balancing.  In  the  following,  the  total  parallel  execution  time  for  parallel  interval  analysis  is  called 
Toted  .  We  measured  the  parallel  setup  time  to  perform  steps  1.-3.  and  refer  to  it  as  0  .  Note  that  in 

pur 

our  implementation,  these  steps  are  carried  out  sequentially;  to  speedup  the  setup  time,  steps  1.,  2.  and  3. 
could  be  carried  out  in  parallel. 

Our  sequential  equation  solver  uses  the  standard  interval  analysis  algorithm  and  requires  some 
implementation  specific  setup  work  before  the  tree  nodes  can  be  processed.  In  the  following,  this  setup 
work  of  the  sequential  implementation  is  called  Oseq  and  the  total  sequential  time  to  perform  Phase3  is 
called  Totalseq. 

Since  we  implemented  interval  analysis  for  only  one  data  flow  equation,  the  amount  of  woik  performed  at 
each  node  of  the  complete  interval  tree  is  very  small  compared  to  a  realistic  optimizer.  Therefore  both  the 
sequential  setup  time  Oseq  ind  the  parallel  setup  time  0^  account  for  an  unrealistically  large  portion  of 
Total Sfq  and  Total ^  respectively.  To  get  a  more  accurate  picture  on  the  performance  of  oar 
parallelization,  we  looked  for  a  fair  method  to  factor  out  both  0  and  Ostq  from  the  computation  of  the 
parallel  speedup. 

On  first  sight  it  sounds  plausible  to  subtract  the  parallel  setup  time  0^  from  the  total  parallel  execution 
time  Total ^  and  0,eq  from  Totalseq  respectively  and  use  the  resulting  times  to  compute  the  parallel 
speedup.  However,  the  parallel  setup  time  is  in  general  larger  than  the  sequential  setup  time  since  it 
includes  inherently  expensive  parts  like  for  example  processor  allocation  in  a  time  shared  multi  user 
system.  We  therefore  decided  to  account  only  for  the  sequential  setup  time  by  subtracting  0  from  both 
Total __  and  Total ...  and  obtain  the  speedup  by  dividing  the  two  resulting  numbers  as  shown  in  the 
following  equation: 

speedup -_d- 

In  the  next  section  the  speedup  observed  always  refers  to  the  speedup  introduced  in  the  previous  equation. 

3.6.4.  Speedup  over  sequential  interval  analysis 

We  were  able  to  observe  parallel  speedup  for  every  benchmark  function.  The  speedup  varied  with  the 
size  of  the  benchmark  function.  Figure  3-11  depicts  the  speedup  for  f30  and  *6*'  Recall  that  the  best 
possible  speedup  is  at  most  p-1  if  we  use  p  processors  in  parallel,  since  one  of  the  processors  is  dedicated  to 
the  master  thread.  For  both  functions  the  speedup  increases  with  increasing  number  of  processors  and 
decreases  again  when  the  number  of  processors  exceeds  7  -  an  indication  that  scheduling  overhead  cancels 
the  parallel  speedup  for  small  functions.  In  general  the  observed  speedup  is  disappointing  even  though  it 
increases  for  fM.  The  situation  changes  drastically  for  f176  shown  in  Figt  re  3-12.  The  speedup  increases 
almost  linearly  when  the  number  of  processors  is  small.  We  observe  5-fold  speedup  using  9  processors. 
Adding  more  processors  results  in  a  decrease  in  performance. 
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Figure  3-11:  Speedup  for  small  functions 


Figure  3-12:  Speedup  for  medium  function 


Number  of  threads 


Figure  3-13:  Speedup  for  large  functions 

Figure  3-13  depicts  the  speedup  for  f^  and  f296.  Adding  basic  blocks  helps:  functions  f2(J7  and  f296 
show  almost  linear  speedup  and  for  both  functions  we  do  not  have  a  "local  speedup  maximum",  that  is,  the 
speedup  increases  steadily  with  increasing  number  of  processors. 
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3.7.  Discussion  of  the  observed  speedup 

Our  measurements  show  that  for  functions  that  are  large  enough,  the  speedup  over  sequential  interval 
analysis  is  cons;derable.  These  results  are  particularly  encouraging  considering  the  simplicity  of  our 
implementation:  the  setup  woric  is  performed  sequentially,  the  scheduling  strategy  is  very  simple,  and  we 
were  still  able  to  observe  up  to  7-fold  speedup  using  no  more  than  1 1  processors,  one  of  which  is  dedicated 
to  load  balancing  and  does  not  process  any  nodes  of  the  complete  interval  tree.  For  all  benchmark 
functions,  the  optimal  theoretical  speedup  is  much  higher  than  the  observed  speedup.  For  instance,  a  lower 
bound  of  the  parallel  efficiency  of  the  complete  interval  tree  of  our  smallest  example,  f30,  was  10.  In  other 
words,  when  communication  cost  and  system  overhead  are  ignored,  the  parallel  speedup  should  increase 
for  each  added  processor  up  to  10  processors,  and  it  should  decrease  when  more  *han  10  processors  execute 
in  parallel.  The  observed  speedup  depicted  in  Figure  3-11  shows  that  the  observed  speedup  decreases  when 
more  than  6  processors  operate  in  parallel,  so  implementation  and  system  overhead  as  well  as 
communication  cost  cancel  parallel  speedup  w'  en  too  many  processors  operate  in  parallel.  We  observed 
the  same  phenomenon  for  our  other  benchmark  functions  and  conclude  that  sophisticated  methods  to  find  a 
good  or  even  optimal  schedule  for  the  nodes  in  a  complete  interval  tree  does  not  increase  the  observed 
speedup,  and  a  simple  scheduling  strategy  suffices  for  good  practical  results. 

3.7.1.  Possibilities  to  increase  the  speedup 

There  are  two  possibilities  to  increase  the  effectiveness  of  parallel  interval  analysis.  First,  better  parallel 
speedup  can  be  expected  when  the  size  of  u.e  input  functions  is  increased.  Second,  implementation  and 
system  overhead  decreases  relative  to  the  total  execution  time  when  the  workload  in  the  nodes  of  the 
complete  interval  tree  is  increased. 

One  way  to  increase  the  size  of  program  flow  graphs  is  to  use  procedure  inlining  if  a  program  consists  of 
many  small  functions.  Thus,  even  when  the  size  of  the  flow  graphs  is  small,  our  approach  can  yield  good 
speedup  when  combined  with  inlining,  an  important  optimization  in  its  own  right.  Further,  many  functions 
consist  of  more  basic  blocks  than  our  benchmark  functions,  and  for  such  large  functions,  better  speedup  can 
be  expected. 

Since  we  only  implemented  the  solution  of  one  data  flow  equation,  the  amount  of  woric  performed  at  each 
node  of  the  complete  interval  tree  smaller  than  in  a  typical  global  optimizer.  Since  the  parallel  overhead  of 
parallel  interval  analysis  is  independent  of  the  number  of  data  flow  equations  even  better  speedup  can  be 
expected  if  the  number  of  equations  solved  is  increase d- 

3.7 2.  Application  spectrum  for  parallel  interval  analysis 

The  intervals  of  a  flow  graph  correspond  to  the  loop  structure  of  the  program.  Working  on  (independent) 
intervals  in  parallel  means  that  loops  of  the  same  loop  nesting  level  are  treated  in  parallel.  Other  global 
optimizations  that  allow  to  treat  program  loops  on  the  same  nesting  level  independently  can  be  mapped  into 
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the  same  parallel  framework.  Examples  are  th ;  extension  of  interval  analysis  to  handle  arravs  reported  in 
[GrossSteenkiste  90]  and  other  methods  for  data  dependence  analysis  used  in  vectorizing  compilers.  The 
fact  that  we  parallelized  the  solution  of  an  availability  system  does  not  restrict  our  approach  to  just  data 
flow  equations.  Our  measurements  indicate  that  parallelism  based  on  program  structure  can  yield 
significant  speedup  -  independent  of  the  specific  optimization  problem  that  is  solved  in  parallel. 


3.8.  Related  work 

Other  parallel  algorithms  for  global  data  flow  algorithms  are  reported  in  [LeeMarloweRyder  91]  and 
[GuptaPollockSoffa  90].  The  parallel  hybrid  algorithm  described  in  [LeeMarloweRyder  91]  combines  the 
iterative  technique  with  a  structured  method  for  data  flow  analysis.  Maximal  strongly  connected 
components  in  the  flow  graph  are  processed  in  parallel.  The  iterative  method  is  used  to  compute  local  data 
flow  information  for  each  component,  and  in  a  subsequent  propagation  phase  global  data  flow  information 
is  computed  for  each  component  The  communication  between  parallel  processes  in  this  approach  is 
greater  than  the  communication  required  for  parallel  interval  analysis.  In  the  measurement,  of  the  parallel 
speedup,  the  number  of  processors  that  execute  concurrently  is  varied  between  2  and  8.  The  best  parallel 
speedup  reported  in  that  paper  is  less  than  5. 

Goal  of  the  approach  taken  in  [GuptaPollockSoffa  90]  is  to  partition  a  program  flow  graph  independently 
of  program  structure.  A  program  is  decomposed  into  single-entry-single-exit  regions  that  can  be  smaller 
than  intervals,  therefore  the  size  of  the  parallel  tasks  can  be  chosen  more  evenly.  No  implementation  or 
parallel  speedups  are  reported.  This  method  has  the  potential  to  create  more  parallelism  in  global  data  flow 
analysis,  but  it  is  not  clear  whether  this  parallelism  can  be  taken  advantage  of  in  an  implementation.  Our 
own  measurements  indicate  that  even  the  coarser  grained  parallelism  in  parallel  interval  analysis  is 
cancelled  by  system  and  implementation  parameters  when  too  many  processors  execute  concurrendy. 

3.9.  Chapter  summary 

Our  parallelization  of  interval  analysis  is  based  on  explicit  program  structure.  Data  locality  is  given  by 
the  loop  structure  of  the  input  program  and  can  be  exploited  for  the  parallelization  in  a  straightforward 
manner.  Basis  of  our  parallel  implementation  is  the  complete  interval  tree  which  captures  the  interval 
partitions  of  a  program.  It  is  NP  complete  to  compute  a  lower  bound  on  the  number  of  processors  needed 
to  process  a  given  inpul  program  in  minimal  time,  but  we  gave  some  simple  approximations  of  this  lower 
bound.  Our  implementation  used  a  very  simple  scheduling  algorithm  for  the  nodes  in  the  complete  interval 
tree,  and  we  were  able  to  observe  considerable  parallel  speedup.  Due  to  system  and  implementation 
overhead,  the  speedup  declined  when  too  many  processors  operated  concurrendy.  In  all  cases  the  number 
of  processors  for  which  the  parallel  speedup  declined  was  smaller  than  a  lower  bound  on  the  number  of 
processors  needed  to  process  a  complete  interval  tree  in  minimal  time.  We  conclude  that  a  straightforward 
parallelization  based  on  the  loop  structure  of  a  program  works  well  in  practice,  and  that  complicated 
scheduling  algorithms  or  efforts  to  create  finer  grained  parallelism  do  not  result  in  increased  observed 
speedup. 
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Chapter  4 

Global  register  allocation:  background 


4.1.  Introduction 

In  the  previous  chapter  we  showed  a  simple  parallelization  of  global  data  flow  analysis  that  was  based  on 
explicit  program  structure.  Data  locality  was  given  by  the  loop  structure  of  the  input  program  and  the 
results  for  an  inner  loop  could  be  embedded  into  the  results  for  an  enclosing  loop  without  backtracking.  A 
number  of  global  compiler  optimizations  do  not  fit  into  this  framework  because  in  general  it  is  not  possible 
to  embed  a  partial  solution  of  one  program  construct  into  the  enclosing  loop  or  conditional  without 
backtracking.  We  call  such  compiler  optimizations  unstructured  optimizations.  An  example  of  such  an 
optimization  is  global  register  allocation.  Global  register  allocation  is  the  problem  of  mapping  variables 
that  live  across  basic  block  boundaries  to  machine  registers.  The  problem  of  optimal  global  register 
allocation  is  mathematically  equivalent  to  the  problem  of  finding  an  optimal  coloring  for  an  undirected 
graph  representing  register  conflicts.  Global  register  allocation  via  graph  coloring  is  used  in  a  variety  of 
compilers.  In  general  it  is  assumed  that  the  graph  denoting  the  register  conflicts  of  a  program  is  an 
arbitrary  graph.  Finding  an  optimal  coloring  of  arbitrary  graphs  is  NP  complete. 

Given  a  fixed  number  of  processors,  it  Is  very  difficult  to  come  up  with  a  straightforward  parallelization 
of  an  NP  complete  problem.  The  research  goal  for  the  second  part  of  this  thesis  is  to  investigate  whether 
knowledge  about  program  structure  can  be  used  to  implicitly  guide  the  parallelization  of  global  register 
allocation.  In  other  wo.  we  examine  whether  the  knowledge  about  program  structure  can  be  used  to 
analyze  the  graph  that  represents  register  conflicts  such  that  points  can  be  found  at  which  the  graph  can  be 
partitioned  into  independent  components.  The  partitioning  of  the  conflict  graph  into  independent 
components  is  based  on  clique  separators,  first  introduced  in  [Taijan  85].  The  problem  is  that  for  arbitrary 
graphs  it  is  very  difficult  to  detect  clique  separators. 

The  first  step  of  our  research  is  to  establish  a  connection  between  program  structure  and  register  conflict 
graphs.  We  propose  a  model  in  which  structural  knowledge  about  a  program  is  encoded  in  the  register 
conflict  graph.  We  then  use  this  knowledge  to  deduce  characteristics  of  the  partial  register  conflict  graph 
of  individual  loops  and  conditionals  that  are  later  used  to  detect  clique  separators  in  the  conflict  graph. 

This  chapter  provides  the  background  for  our  model.  We  first  give  some  basic  definitions  and  the 
description  of  our  input  modeL  We  then  introduce  die  standard  method  for  register  allocation  via  graph 
coloring,  followed  by  an  example  of  a  conflict  graph  that  can  not  be  colored  with  the  standard  method.  We 
conclude  the  chapter  by  classifying  the  live  ranges  that  occur  in  loops  and  conditionals. 
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4 2.  Global  register  allocation:  basic  definitions 

Registers  are  a  scarce  resource  on  all  modem  computer  architectures  [Pat/Hen  90].  Optimizations  like 
common  subexpression  elimination,  constant  propagation  etc.  create  temporaries  that  can  be  used  most 
efficiently  if  they  reside  in  registers.  Once  a  compiler  has  decided  which  variables  and  temporaries  are  to 
be  placed  into  registers,  register  allocation  is  the  problem  of  mapping  machine  registers  to  variables  and 
temporaries  such  that  runtime  efficiency  of  the  machine  code  is  maximized.  In  general,  variables  and 
temporaries  are  not  used  in  every  basic  block  of  the  program  flow  graph,  but  have  local  or  global  live 
ranges. 

Definition  1:  (Local  live  range)  A  local  live  range  of  a  variable  v  is  a  sequence  of  instructions 
ij, ...  in  such  that  i2  is  a  definition  of  v,  and  in  is  the  last  use  of  v  before  a  re-definition  of  v. 
Further,  all  instructions  of  that  sequence  occur  in  the  same  basic  block. 

In  general  global  variables  or  temporaries  created  by  global  optimizations  are  live  across  basic  block 
boundaries.  Global  data  flow  analysis  determines  for  each  basic  block  b  which  variables  and  temporaries 
are  live  at  the  entry  to  the  basic  block  [Aho  84].  Similar  to  local  live  ranges  that  consist  of  sequences  of 
instructions,  global  live  ranges  consist  of  sets  of  basic  blocks.  Because  the  exact  order  of  instructic.is 
inside  basic  blocks  is  usually  not  known  at  the  time  of  global  register  allocation,  we  assume  that  global  live 
ranges  extend  throughout  the  instructions  of  every  basic  block  that  is  part  of  the  live  range.  So  in  our 
model,  global  live  ranges  start  and  stop  at  basic  block  boundaries.  Our  definition  of  a  global  live  range  is 
based  on  tie  live  range  graph  of  a  variable,  defined  below. 

Definition  2:  (Live  range  graph  of  a  variable)  Given  the  flow  graph  of  a  program  consisting  of 
basic  blocks  B,  the  live  range  graph  of  variable  a  consists  erf  vertices  V  and  directed  edges  E.  The 
set  of  vertices  of  the  live  range  graph  is  defined  as  ve  Viff  ve  5  and  a  is  live  in  v.  The  edges  of 
the  live  range  graph  are  defined  as  e=(v1,v2)e  E  iff  there  is  an  edge  from  v,  to  v2  in  the  flow 
graph  and  any  definition  of  a  in  v2  is  preceded  by  a  use  of  a  in  v2. 

The  live  range  graph  of  a  variable  is  the  basis  for  our  computation  of  live  ranges: 

Definition  3:  (Live  ranges  of  a  variable)  The  live  ranges  of  a  variab’e  a  are  the  connected 
components  of  the  live  range  graph  of  a. 

In  other  words,  the  live  range  of  a  variable  is  a  contiguous  set  of  basic  blocks  of  the  flow  graph  in  which 
the  variable  lives.  A  variable  can  have  several  independent  live  ranges.  If  variable  a  fives  in  basic  blocks 
bj  and  b2,  and  b1  and  b2  are  not  part  of  the  same  live  range,  the  variable  can  reside  in  different  registers  in 
bj  and  b2  respectively. 

Figure  4-1  shows  both  a  flow  graph  and  the  live  range  graph  for  variable  a. 

Variable  a  is  five  in  basic  blocks  Bl,  B2,  B4,  B5,  the  split  node  and  B6.  The  live  range  graph  for  a  is 
depicted  to  the  right  Because  there  is  a  definition  of  a  in  B4  that  is  not  {seceded  by  a  use  of  a,  there  is  no 
edge  from  B2  to  B4  in  the  five  range  graph.  For  the  same  reason,  there  is  no  edge  between  B4  and  B5  in  the 
live  range  graph.  The  connected  components  of  the  live  range  graph  are  (B4),  (B5E6). 

Therefore,  a  has  three  distinct  live  ranges.  The  live  range  that  consists  of  B4  is  a  local  five  range,  the  other 
two  are  global  live  ranges. 
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Figure  4-1:  Sample  live  range 


Local  live  ranges  and  global  live  ranges  are  mapped  to  machine  registers  in  different  phases  -  local 
register  allocation  and  global  register  allocation  respectively.  Sometimes,  local  register  allocation  is 
performed  fairly  late  in  the  compilation  process,  because  some  instruction  reordering  inside  basic  blocks 
might  be  done  by  the  code  scheduler.  Further,  some  compilers  assume  dedicated  registers  for  global  live 
ranges. 

One  major  difference  between  local  register  allocation  and  global  register  allocation  is  that  local  register 
allocation  is  performed  for  live  ranges  that  consist  of  instructions  that  form  straight  line  code.  We  will  see 
that  optimal  global  register  allocation  can  be  solved  efficiently  for  straight  line  code.  Local  register 
allocation  is  often  done  at  code  selection  time,  and  is  in  that  case  a  non  trivial  task. 

Because  global  live  ranges  consist  of  sets  of  basic  blocks  that  may  be  part  of  loops  or  conditionals,  the 
register  conflict  graphs  for  global  live  ranges  are  more  complex,  and  finding  an  optimal  coloring  for  those 
graphs  is  NP  complete. 

The  research  focus  of  this  thesis  is  on  global  register  allocation.  Before  we  introduce  our  model  of 
structural  global  register  allocation,  we  introduce  register  conflict  graphs  and  the  standard  method  for 
global  register  allocation  via  graph  coloring. 
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42.1.  Standard  method  for  global  register  allocation  by  graph  coloring 

The  standard  method  for  global  register  allocation  via  graph  coloring  operates  on  the  register  conflict 
graph.  Given  a  flow  graph  and  for  each  basic  block  in  the  flow  graph  the  set  of  variables  and  temporaries 
that  are  live  in  the  basic  block,  the  standard  register  conflict  graph  is  defined  as  follows: 

Definition  4:  (Standard  register  conflict  graph)  A  standard  register  conflict  graph  G=(V£) 
consists  of  vertices  N  and  edges  E.  Each  n  e  N  corresponds  to  a  global  live  range.  Two  vertices 
nJt  n2  are  connected  by  an  edge  e  iff  there  is  at  least  one  basic  block  b  such  that  b  is  part  of  both 
live  ranges  n2  and  n2. 

In  other  words,  n2  and  n2  may  co-exist  during  program  execution  and  hence  must  be  placed  in 
different  registers. 

Definition  5:  (Chromatic  number  of  a  graph )  The  chromatic  number  of  a  graph  G  is  the 
minimal  number  of  colors  needed  to  color  all  nodes  in  G  such  that  no  two  adjacent  nodes  have 
the  same  color. 

The  most  commonly  used  method  to  color  register  conflict  graphs  was  introduced  in  [Chaitin  81]  and 
works  as  follows.  Let  k  be  the  number  of  available  machine  registers  and  G  be  the  register  conflict  graph. 
A  ^-coloring  of  G  based  on  the  following  observation:  if  G  has  a  node  n  with  less  than  k  adjacent  nodes,  G 
is  Jfc-colorable  iff  the  graph  G'  obtained  by  removing  n  and  all  edges  incident  on  n  is  ^-colorable.  Thus,  all 
no'Vis  of  degree  less  than  k  are  removed  from  G  until  G’  is  empty  or  consists  only  of  nodes  with  degree 
greater  than  k. 

If  G’  is  not  empty  and  no  nodes  can  be  removed  from  G\  the  standard  method  assumes  that  no  ^-coloring 
exists  for  G,  and  that  therefore  not  all  live  ranges  can  reside  in  registers  at  all  times.  The  basic  idea  to  solve 
this  problem  is  to  remove  nodes  from  G'  and  assume  that  the  corresponding  live  ranges  are  stored  in 
memo  ather  than  register ,  The  node  removal  process  is  repeated  on  the  altered  graph  until  a  ^-coloring 
is  found. 

This  algorithm  is  used  for  many  implementations  of  global  register  allocation  [KilLar  86,  Wall 
86,  ChowHen  90].  In  the  following  we  will  refer  to  this  coloring  method  as  the  node  removal  technique. 

4 22.  Shortcomings  of  the  node  removal  technique 

In  the  standard  node  removal  technique,  the  decision  of  which  node  is  removed  next  is  guided  by  the 
number  of  outgoing  edges.  Two  nodes  n2  and  n2  that  are  removed  from  the  graph  in  sequence  might 
correspond  to  live  ranges  of  unrelated  parts  of  the  flow  graph  -  the  program  structure  does  not  play  a  role  in 
the  overall  coloring  process.  Further,  there  exist  register  conflict  graphs  that  are  /t-colorable,  but  in  which 
each  node  has  at  least  k  neighbors,  so  the  node  removal  technique  is  not  able  to  produce  a  ^-coloring.  See 
for  instance  the  graph  depicted  in  Figure  4-2. 

Figure  4-2  shows  a  register  conflict  graph  in  which  each  node  has  at  least  3  neighbors.  Given  the  number 
of  available  register  is  3,  the  node  removal  technique  decides  that  at  least  one  of  the  nodes  must  be  spilled 
to  memory.  As  shown  in  the  Figure,  a  3-coloring  for  the  graph  exists. 
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Figure  4-2:  Example  where  standard  method  fails  to  come  up  with  a 

^-coloring 

In  our  approach  to  global  register  allocation,  we  use  knowledge  about  the  program  structure  to  guide  the 
register  allocation  process.  We  will  see  later  that  there  are  some  cases  in  which  the  structural  knowledge 
encoded  in  the  register  conflict  graph  can  be  used  to  produce  ^-colorings  where  the  node  removal  technique 
is  unable  to  do  so. 

We  now  discuss  our  classification  of  live  range  as  they  occur  in  straight  line  code,  loops  and  conditionals. 
Based  on  these  classifications,  we  show  some  properties  of  register  conflict  graphs  for  each  programming 
construct  in  the  following  chapters. 

4 23.  Continuous  and  broken  live  ranges 

We  partition  the  live  ranges  of  variables  Luo  two  groups:  continuous  and  broken  live  ranges. 

Definition  6:  (Broken  live  range)  Given  a  vaiiable  v  with  live  range  l  that  consists  of  basic 
blocks  we  say  that  l  is  broken  iff  there  are  two  basic  blocks  bi  and  bj  both  in  {bv...J)n} 

that  meet  the  following  condition:  There  is  a  backarc  free  path  from  bt  to  bj  that  contains  a  basic 
block  bk  such  that  bk  contains  a  definition  of  v  and  every  use  of  v  in  bk  is  preceded  by  that 
definition. 


Definition  7:  (Continuous  live  range)  A  live  range  is  continuous  iff  it  is  not  broken. 

Examples  of  broken  and  a  continuous  live  ranges  are  given  in  Figure  4-3.  In  the  loop  labeled  A,  variable  v 
lives  in  basic  blocks  1,3  and  4.  Because  there  is  a  backarc  free  path  between  blocks  1  and  3  and  3  contains 
a  definition  of  v,  the  live  range  for  v  is  broken.  The  conditional  labeled  B  also  contains  a  broken  live  range 
for  variable  v  -  there  is  a  backarc  free  path  from  block  1  to  block  3  and  there  is  a  definition  of  v  in  block  3. 
Hence,  the  live  range  for  v  is  broken.  The  live  range  of  v  depicted  in  the  flow  graph  labeled  C  is 
continuous.  Even  though  the  live  range  of  v  in  the  flow  graph  labeled  D  contains  all  basic  blocks  of  that 
flow  graph  it  is  a  broken  live  range  because  the  backarc  free  path  between  blocks  2  and  4  contains  block  3 
which  contains  a  definition  of  v. 


Intuitively,  broken  live  ranges  can  contain  "holes"  -  holes  in  a  broken  live  range  consist  of  basic  blocks 
that  are  not  in  the  live  range  but  "between"  basic  blocks  that  form  the  live  range.  More  formally,  the 
definition  of  a  hole  is  based  on  an  undirected  graph  which  we  call  hole  graph. 


Figure  4-3:  Examples  of  broken  and  continuous  live  ranges 

Definition  8:  (Hole  graph  of  a  live  range)  Given  a  flow  graph  F  with  basic  blocks  B  and  a  live 
range  /,  the  hole  graph  of  l  is  the  graph  induced  by  the  set  of  nodes  {£-/}  and  all  edges  of  F 
except  F’s  backedges. 

In  other  words,  the  hole  graph  of  a  live  range  l  is  derived  from  the  flow  graph  by  removing  all  basic  blocks 
that  form  /,  all  edges  incident  on  basic  blocks  in  l  and  all  backarcs.  The  connected  components  of  a  hole 
graph  of  a  live  range  are  called  holes  of  that  live  range  provided  these  connected  components  are  "between" 
basic  blocks  that  are  part  of  the  live  range.  In  other  words,  there  must  be  a  backarc  free  path  from  the  live 

range  that  leads  into  the  hole,  and  there  must  be  a  backarc  free  path  from  the  hole  that  leads  into  the  live 

range.  More  formally: 

Definition  9:  (Hole  of  a  live  range)  Given  a  flow  graph  F  and  a  live  range  /  consisting  of  basic 
blocks  {bx,...Jbn},  leg  //  be  the  hole  graph  of  /.  A  connected  component  c  of  H  is  called  a  hole  of 

/  iff  3bc  G  c  such  dial 

1. 3i(e  {bv...J)n}  such  that  there  is  a  backarc  free  path  from  bt  to  bc 
2. 3bj  e  [bv...J?n]  such  that  there  is  a  backarc  free  path  from  bc  to  bj 
Figure  4-4  gives  an  example  of  a  broken  live  range  and  its  hole  graph.  In  that  example,  the  flow  graph  to 
the  left  consists  of  a  loop.  Variable  a  has  a  broken  live  range  consisting  of  basic  blocks  {1,2,7 ,8}.  The 
hole  graph  is  derived  from  the  loop  by  removing  those  basic  blocks  and  all  backedges  as  well  as  edges  to 
and  from  {1,2,7 ,8}.  The  hole  graph  consists  of  one  connected  component.  Hence,  the  live  range  for  a  has 
only  one  hole,  which  consists  of  all  basic  blocks  in  the  hole  graph. 

Some  broken  live  ranges  do  not  contain  any  holes  -  even  though  there  is  a  re-definition  in  some  basic 
block  that  is  part  of  the  broken  live  range,  the  hole  graph  of  that  live  range  can  be  empty.  In  that  case,  the 
broken  live  range  can  be  treated  as  if  it  were  continuous.  We  call  such  live  ranges  continuous  equivalent. 
The  live  range  depicted  in  the  flow  graph  labeled  D  in  Figure  4-3  is  an  example  of  a  continuous  equivalent 
live  range. 

Definition  10:  (Continuous  equivalent  live  range)  We  say  that  a  broken  live  range  /  is 
continuous  equivalent  iff !  contains  no  holes. 

One  nice  property  of  continuous  equivalent  live  ranges  is  that  for  each  continuous  equivalent  live  range 
there  exists  an  equivalent  continuous  live  range.  What  sounds  like  a  play  of  words  will  be  formally  shown 
in  the  next  lemma. 
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Figure  4-4:  Hole  graph  and  hole  of  a  broken  live  range 

Lemma  11:  Given  a  register  conflict  graph  G  that  contains  a  continuous  equivalent  live  range  / 
for  a  variable  v,  let  .,bm)  be  the  set  of  basic  blocks  that  contain  a  definition  of  v  that  is  not 
preceded  by  a  use  of  v  in  the  same  basic  block.  The  live  range  does  not  change  if  a  use  of  v  is 
inserted  into  every  b-e  [bx,...,bm). 

Proof:  Because  /  contains  no  holes,  no  backarc  free  path  between  any  two  basic  blocks  in  /  can 
contain  a  basic  block  that  is  not  part  of  /.  In  particular,  every  path  to  a  basic  block  that  contains  a 
definition  of  v  consists  of  basic  blocks  that  lie  entirely  in  /.  Hence,  adding  a  use  of  v  at  the  top  of 
the  basic  blocks  that  define  v  is  not  going  to  change  /. 

By  adding  uses  of  v  to  the  defining  blocks  that  are  part  of  a  contiguous  equivalent  live  range  for  variable 
v,  the  broken  live  range  is  changed  into  a  continuous  live  range.  Hence,  for  the  purpose  of  register 
allocation,  continuous  equivalent  live  ranges  can  be  turned  into  continuous  live  ranges  by  modifying  the 
program  slightly.  Like  broken  live  ranges  continuous  equivalent  live  ranges  are  caused  by  re-definitions  in 
loops  or  conditionals  and  it  must  be  ensured  that  both  parts  of  a  continuous  equivalent  live  range  end  up  in 
the  same  register.  Hence  changing  a  continuous  equivalent  live  range  into  an  equivalent  continuous  live 
range  has  no  influence  on  the  subsequent  graph  coloring. 

A  program  that  consists  entirely  of  straight  line  code  can  only  contain  continuous  live  ranges,  stated 
formally  in  the  next  lemma. 

Lemma  12:  Given  a  piece  of  straight  line  code  consisting  of  basic  blocks  {bx,...J>n),  let  /  be  a 
live  range  that  consists  of  basic  blocks  that  form  a  subset  of  { bv...Jbn }.  Then,  /  is  continuous. 

Proof:  Given  a  piece  of  straight  line  code  {bj,...^n},  there  is  at  most  one  path  between  any  two 
blocks  biybj£  [bx,...J)n].  Given  a  live  range  /  for  a  variable  v,  we  show  by  contradiction  that  / 
must  be  continuous. 

Let  /  be  a  broken  live  range  for  variable  v  consisting  of  basic  blocks  that  form  a  subset  of 
{bj,...^n}.  By  definition  of  broken  live  ranges,  there  must  be  two  basic  blocks  bi  and  b]  that  are 
part  of  /,  and  there  must  be  a  path  from  bt  to  bj  that  contains  a  basic  block  bk  that  contains  a 
definition  of  v  which  is  not  preceded  by  a  use  of  v  in  By  definition  of  live  range  graphs,  there 
can  be  no  edge  between  bk  and  a  basic  block  preceding  b^  Because  the  flow  graph  contains  no 
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brckarcs,  bk  :an  not  be  part  of  the  same  connected  component  as  bt,  smce  bi  must  precede  b^ 
Because  bk  precedes  bj,  bj  can  not  be  part  of  the  same  connected  component  as  bi  -  a 
contradiction.  Therefore,  /  must  be  continuous. 

By  Lemma  12  we  know  that  in  well  structured  programs  only  loops  and  conditionals  can  contain  re¬ 
definitions  that  cause  a  hole  in  a  live  range.  Holes  in  a  live  range  for  a  variable  v  occur  if  several 
definitions  of  v  are  used  in  the  same  basic  block.  A  loop  can  contain  a  re-definition  of  a  variable  v  at  the 
bottom  of  the  loop  that  is  used  at  the  top  of  the  loop.  A  hole  that  is  caused  by  such  a  re-definition  consists 
of  the  basic  blocks  "between"  the  basic  block  that  contains  the  use  and  the  basic  block  that  contains  the 
re-definition.  An  example  of  this  situation  is  given  in  Figure  4-5.  The  figure  shows  both  a  nested  loop  and 


Figi1  re  4-5:  Hole  caused  by  a  re-definition  inside  a  loop 
the  interval  representation  of  the  register  conflict  graph.  The  live  range  for  variable  a  contains  a  hole 
consisting  of  basic  blocks  {3,4 ,5,6} .  The  hole  of  a  is  caused  by  a  re-definition  of  a  in  basic  block  7,  which 
is  part  of  the  innermost  loop.  Basic  block  2,  which  uses  that  re-definition  of  a,  is  part  of  the  outer  loop,  but 
not  of  the  inner  loop.  Basic  blocks  between  the  use  and  the  re-definition  are  {3,4,5 ,6}  -  the  basic  blocks 
that  form  the  hole. 

A  conditional  can  contain  one  branch  clause  c  with  a  re-definition  of  a  variable  v  with  that  is  live  in  both 
the  split  node  and  the  join  node  of  the  conditional.  Hence,  there  must  be  a  definition  of  v  in  a  basic  block 
that  occurs  on  the  path  from  the  program  entry  to  the  split  node  and  a  use  of  v  in  a  basic  block  b  that  is 
dominated  by  the  join  node.  Hence,  both  definitions  of  v  in  b  and  in  the  branch  clause  c  are  used  outside 
the  conditional  An  example  of  a  conditional  that  contains  a  re-definition  of  such  a  live  range  is  shown  in 
Figure  4-6. 

The  live  range  for  variable  g  consists  of  basic  blocks  {1,2, 5, 6,7}.  The  hole  graph  for  g  consists  of  basic 
blocks  3  and  4  and  is  depicted  to  the  right.  Because  there  is  a  backarc  free  path  from  basic  block  2  to  basic 
block  3  and  there  is  a  backarc  free  path  from  basic  block  4  to  basic  block  7,  blocks  {3,4}  constitute  a  hole 
of  the  live  range  for  g.  That  hole  is  caused  by  the  re-definition  of  g  in  basic  block  5  which  is  part  of  one 
branch  clause.  Both  definitions  of  g  in  basic  blocks  1  and  5  are  used  in  basic  block  7. 
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Conditional  with  BGLOBAL  liv«  rmngo  lor  g 

Figure  4-6:  Hole  caused  by  re-definition  of  a  variable  with  BGLOBAL  live 

range  in  a  conditional 

A  hole  of  a  live  range  can  share  basic  blocks  with  more  than  one  programming  construct  In  the  example 
depicted  in  Figure  4-6,  the  hole  of  variable  g’s  live  range  contains  basic  blocks  {3,4}  which  are  both  part  of 
the  loop  consisting  of  basic  blocks  {3,4,5}  and  of  the  entire  conditional.  Given  the  live  range  for  g, 
{ 1,2,5 ,6,7},  the  fact  that  the  basic  blocks  "between"  2  and  5  are  missing  makes  {3,4}  a  hole.  Looking  at 
the  loop  in  isolation,  the  live  range  of  g  consists  only  of  basic  block  5  and  therefore  basic  blocks  3  and  4  are 
not  recognized  as  a  hole  of  g's  live  range.  Considering  the  conditional  in  isolation  (that  is,  basic  blocks 
{23,4,5,6,7}),  the  live  range  of  g  consists  of  basic  blocks  {23,6,7}.  Note  that  the  hole  of  g's  live  range 
exists  in  the  conditional,  even  when  the  rest  of  the  flow  graph  is  ignored.  Hence,  the  hole  of  the  live  range 
for  g  can  be  "linked"  to  the  conditional  but  not  the  loop  nested  it  *  ;de  the  conditional. 

The  reason  why  wc  want  to  link  a  hole  to  a  particular  loop  or  conditional  is  that  the  exact  position  of  the 
hole  in  the  program  flow  graph  determines  which  simplifications  can  be  carried  out  on  the  flow  graph  such 
that  the  register  conflict  graph  is  unchanged  by  those  simplifications. 

Given  the  set  of  loops  and  conditionals  in  a  well  structured  program  that  share  basic  blocks  with  a  given 
hole  h  for  a  broken  live  range  l,  we  link  h  to  the  innermost  programming  construct  that  contains  a  subset  of 
l  such  that  A  is  a  hole  of  this  subset,  In  our  previous  example  (Figure  4-6),  the  conditional  is  the  innermost 
programming  construct  that  contains  a  subset  g'  of  the  live  range  for  g  such  that  g’  is  broken  and  has  the 
same  hole  as  g.  Now  more  formally: 

Definition  13:  (Linking  of  a  hole)  Given  a  well  structured  flow  graph  F  that  contains  a  broken 
live  range  l=[lv...,ln)  that  contains  a  hole  h=(hx,...Jih),  let  P  be  the  set  of  loops  and 
conditionals  in  F  such  that  each  loop  or  conditional  in  P  contains  at  least  one  basic  block  in 
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f  hx hh).  For  pe  P  let  l =lnp,  that  is  the  intersection  of  the  basic  blocks  that  form  p  and  /.  We 

say  that  h  is  linked  to  p  iff 

1.  h  is  a  hole  of  lp 

2.  for  every  loop  or  conditional  p'  nested  inside  of  p,  h  is  not  a  hole  of  p' . 

In  our  example  depicted  in  Figure  4-5,  the  hole  of  variable  a  is  linked  to  the  outer  loop,  and  in  the  example 
depicted  in  Figure  4-6,  the  hole  of  the  live  range  for  g  is  linked  to  the  outer  conditional. 

4 3.  Live  ranges  in  loops 

The  intuitive  background  of  our  classification  of  live  ranges  in  loops  is  the  impact  they  have  on  register 
conflict  graphs  for  loops.  We  partition  the  live  ranges  in  loops  into  backarc  2nd  forward  live  ranges 

More  formally: 

Definition  14:  ( Forward  live  range)  A  global  live  range  b} . bn  of  a  variable  v  is  called  a 

forward  live  range  of  a  loop  consisting  of  basic  blocks  {/j . /m}  iff 

1.  [bl,...J>n}n(lv...,lm)*0 

2.  all  paths  from  basic  blocks  b^f  e  [bl,...J}n}n{lv..Jm]  that  contain  a  definition  of  v  to 

buse  e  {bv-<bn)nV\ . U  11131  use 11,6  definition  in  b^are  backarc  free. 

The  complement  of  forward  live  ranges  are  backarc  live  ranges,  defined  as  follows: 

Definition  15:  (Backarc  live  range)  A  global  live  range  b]t...,bn  of  a  variable  v  is  called  a 
backarc  live  range  of  a  loop  consisting  of  basic  blocks  {/j . lm]  iff 

1.  {bv...,bn}n{ll,...,lm}*0 

2.  at  least  one  path  from  a  basic  block  b^  e  { bx,...Jbn}r\(lv..Jm }  that  contains  a 

definition  of  v  to  b^  e  (bl,...,bjn(l, . IJ  that  uses  the  definition  in  ^contains  a 

backarc. 

Figure  4-7  depicts  loop  constructs  with  a  forward  live  range  and  a  backarc  live  range  respectively.  The 
live  range  for  a  is  a  forward  live  range,  because  it  is  defined  in  Bl,  and  there  is  no  backarc  on  the  path  from 
B1  to  B2,  where  a  is  used.  The  live  range  for  b  consisting  of  Bl,  B2,  B3  is  a  backarc  live  range,  because 
both  uses  of  b  in  Bl  and  B2  can  only  be  reached  via  a  backarc  from  the  basic  block  that  contains  tlie 
definition  of  b.  Note  that  the  live  range  for  b  is  continuous  -  it  contains  all  basic  blocks  that  form  the  loop. 

The  backarc  live  range  depicted  in  Figure  4-8  is  not  continuous.  The  live  range  for  c  consists  of  blocks 
B4  and  Bl,  and  the  path  from  the  defining  block  B4  to  the  use  in  Bl  contains  a  backarc.  Note  that  the  loop 
contains  two  more  basic  blocks,  B2  and  B3  that  form  a  hole  of  the  live  range  for  c.  Hence,  the  register  for  c 
can  be  used  for  a  different  live  range  that  consists  of  B2  and  B3.  The  live  range  for  c  contains  a  hole  that 
consists  of  basic  blocks  B2  and  B3.  Further,  the  hole  in  the  live  range  is  caused  by  a  definition  of  c  in  block 
B4. 

A  third  example  of  a  backarc  live  range  in  a  loop  is  depicted  in  Figure  4-9.  In  that  example,  variable  c’s 
five  range  consists  of  basic  blocks  [BIJ12J34  ,65}  -  a  broken  live  range.  The  hole  of  the  live  range  for  c  is 
linked  to  the  conditional  with  split  node  B2  and  join  node  B5.  Even  though  the  live  range  for  c  is  broken, 
we  call  it  a  Loop-continuous  backarc  live  range,  because  the  hole  is  not  linked  to  the  loop. 
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Figure  4-7:  Examples  of  backarc  and  torwird  livp  ranges 

4V 

f  use  c; 


B2 


B3 


Figure  4-8:  Another  backarc  live  range 

Definition  16:  (Loop-continuous  live  range)  A  live  range  /  of  variable  v  in  a  loop  L  with  head  h 
and  exit  e  is  called  a  loop-continuous  live  range  iff  there  is  no  hole  h  of  /  such  that  h  is  linked  to 
L. 

Definition  17:  ( Loop-broken  backarc  live  range)  A  back&c  live  range  that  is  not  loop 
continuous  is  called  a  loop-broken  backarc  live  range. 

Figure  4-10  shows  examples  of  loop-broken  and  loop  continuous  backarc  live  ranges. 

The  live  range  fix  variable  c  depicted  to  the  left  consists  of  {51,62,54}.  Because  the  hole  consisting  of 
B3  is  linked  to  the  outer  loop,  the  live  range  for  c  is  a  loop-broken  live  range.  Adding  B3  to  the  live  range 
for  c  changes  it  into  a  loop-continuous  live  range,  shown  in  the  middle  of  Figure  4-10.  The  example 
depicted  to  the  right  shows  another  example  of  a  live  range  that  is  loop  broken:  the  hole  consisting  of  32 
and  the  join  node  is  not  entirely  contained  in  the  inner  conditional.  It  is  easy  to  see  that  a  loop-broken  live 
range  must  be  a  backarc  live  range.  A  live  range  that  contains  a  hole  but  is  not  loop-broken  must  be  a 
forward  live  range  -  the  hole  is  linked  to  a  loop  or  a  conditional  nested  inside  the  loop. 
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Figure  4-9:  A  continuous  backarc  live  range 


Figure  4-10:  Examples  of  loop-broken  and  loop-continuous  live  ranges 


The  class  of  register  conflict  graphs  for  loops  that  contain  broken  live  ranges  contains  arbitrary  circular 
arc  graphs,  for  which  optimal  coloring  is  NP  complete,  whioi  will  be  shown  in  Chapter  5.  We  will  see  that 
only  loops  that  contain  broken  live  ranges  can  cause  arbitrary  register  conflict  graphs.  This  is  not  generally 
the  case  for  conditionals.  The  types  of  live  ranges  in  conditional  branches  are  discussed  next. 
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4.4.  Live  ranges  in  conditionals 

We  partition  the  live  ranges  that  occur  in  conditionals  into  four  classes,  LOEN,  LOEX,  BLOCAL  and 
BGLOBAL.  The  basis  for  this  classification  is  the  presence  or  absence  of  the  split  and  join  node  in  the  live 
ranges.  The  intuitive  reason  for  our  classification  is  that  it  permits  to  detect  dependencies  between  the 
register  conflict  graphs  of  individual  branch  clauses. 

If  each  live  range  in  a  conditional  is  "local"  to  one  disr  branch  clause,  the  conflict  graphs  for  the 
branch  clauses  aie  independent  Therefore,  for  each  i  clause  the  register  conflict  graph  can  be 

colored  independently.  On  the  other  hand,  live  ranges  that  contain  the  split  or  join  node  of  a  conditional  are 
shared  by  the  register  conflict  graphs  of  individual  branch  clauses  and  they  are  no  longer  independent.  We 
will  see  in  Chapter  5  that  dependencies  between  the  conflict  graphs  of  individual  branch  clauses  can  lead  to 
overall  conflict  graphs  that  are  hard  to  color  optimally. 

We  formalize  these  characterizations  and  partition  live  ranges  that  occur  in  a  branch  construct  into  four 
classes: 

Definition  18:  (LOEN  live  range)  A  live  range  is  called  a  LOEN  (Life  On  ENtiy)  live  range  in 
a  conditional  with  split  node  S  and  join  node  J  iff  it  contains  at  least  one  basic  block  that  is  part  of 
a  branch  clause  and  S  but  not  J. 

Definition  19:  (LOEX  live  range)  A  live  range  is  called  a  LOEX  (Live  On  EXit)  live  range  in  a 
conditional  with  split  node  S  and  join  node  J  iff  it  contains  at  least  one  basic  block  that  is  part  of  a 
branch  clause  and  J  but  not  S. 

Definition  20:  (BGLOBAL  live  range)  A  live  range  is  called  a  BGLOBAL  live  range  in  a 
conditional  with  split  node  S  and  join  node  J  iff  it  contains  at  least  one  basic  block  that  is  part  of  a 
branch  clause  and  both  S  and  J. 

Definition  21:  (BLOCAL  live  range)  A  live  range  is  called  a  BLOCAL  live  range  in  a 
conditional  with  split  node  S  and  join  node  J  iff  it  contains  basic  blocks  of  a  branch  clause  but  not 
S  a  J. 

In  the  example  depicted  in  Figure  4-11,  g  is  BGLOBAL,  a  is  LOEN,  d  is  LOEX  and  both  b  and  c  are 
BLOCAL. 

Analogous  to  our  definition  of  loop-continuous  and  loop-broken  live  ranges,  there  are  conditional- 
continuous  and  conditional-broken  live  ranges.  A  conditional-continuous  live  range  can  have  a  hole  that  is 
not  linked  to  the  conditional  itself,  but  instead  to  a  conditional  or  loop  nested  inside.  This  is  depicted  in 
Figure  4-12. 

The  figure  shows  two  nested  conditionals.  The  outer  conditional  with  split  node  1  and  join  node  6 
contains  a  BGLOBAL  live  range  for  v,  which  consists  of  {1,2,4,5,6}.  This  live  range  is  broken,  and 
contains  a  hole  formed  by  block  3.  This  hole  is  caused  by  a  re-definition  of  v  in  basic  block  4,  which  is  part 
of  the  inner  conditional  with  split  node  2  and  join  node  5.  Like  for  loops,  the  live  range  for  v  is  therefore 
conditional  continuous  in  the  outer  conditional,  and  conditional-broken  in  the  inner  conditional.  A  formal 
definition  follows. 

Definition  22:  (Conditional-continuous  live  range)  A  live  range  l  for  a  variable  v  with  a  hole  h 
is  conditional-continuous  in  a  conditional  C  with  split  node  5  and  join  node  J  iff  there  is  no  hole 
h  of  l  such  that  h  is  linked  to  C. 

The  complement  to  a  conditional-continuous  live  range  is  a  conditional  broken  live  range. 
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Conditional  branch 


Figure  4-11:  Types  of  live  ranges  in  a  conditional 


Figure  4-12:  Conditional-continuous  and  conditional-broken  live  range 

Definition  23:  ( Conditional-broken  live  range)  A  live  range  /  for  a  variable  v  is 
conditional-broken  in  a  conditional  C  iff  it  is  not  conditional-continuous  in  C. 
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Note  that  a  conditional  -broken  live  range  must  be  a  BGLOBAL  live  range.  Broken  LOEN ,  LOEX  or 
BLOCAL  live  ranges  in  a  conditional  C  must  be  conditional-continuous  in  C  -  holes  in  such  live  ranges 
must  be  entirely  contained  in  programming  constructs  nested  inside  C. 


4.5.  Chapter  summary 

We  have  classified  the  live  ranges  of  well  structured  programs.  Live  ranges  can  be  continuous  or  broken. 
Live  ranges  in  straight  line  code  are  all  continuous,  while  conditionals  and  loops  can  contain  re-definitions 
that  cause  a  live  range  to  be  broken.  Loops  can  contain  forward  live  ranges  and  backarc  live  ranges.  Live 
ranges  in  loops  can  be  loop-continuous  or  loop-broken.  Conditionals  can  contain  BLOCAL ,  BGLOBAL , 
LOEN  and  LOEX  live  ranges.  A  live  range  l  in  a  conditional  can  be  conditional-continuous  or  conditional- 
broken,  depending  on  which  programming  construct  causes  a  hole  in  l. 

In  the  next  chapters  we  show  how  this  classification  can  be  used  to  analyze  the  shape  of  register  conflict 
graphs  for  loops  and  conditionals. 
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Chapter  5 

Register  conflict  graphs  for  compound  programming  constructs 

In  this  chapter  we  characterize  properties  of  register  conflict  graphs  of  straight  line  code,  loops  and 
conditional  branches.  The  first  part  of  this  chapter  is  devoted  to  re-stating  some  properties  of  register 
conflict  graphs  that  have  been  established  before  [Gol  85,  Fishbum  85,  Bernstein  et  al  89].  We  first 
showing  that  register  conflict  graphs  for  straight  line  code  are  interval  graphs  and  show  that  the  standard 
node  removal  technique  is  able  to  produce  an  optimal  coloring  for  interval  graphs. 

The  class  of  register  conflict  graphs  for  loops  and  conditionals  contain  arbitrary  circular  arc  graphs  for 
which  finding  an  optimal  coloring  is  NP  complete.  When  certain  restrictions  are  met  by  the  live  ranges  that 
occur  in  a  conditional  or  a  loop,  the  register  conflict  graph  of  that  conditional  or  loop  is  equivalent  to  the 
register  conflict  graph  of  straight  line  code.  One  contribution  of  our  structured  model  for  global  register 
allocation  is  that  it  enables  us  to  establish  such  restrictions  systematically.  The  bulk  of  this  chapter  is 
devoted  to  the  description  of  situations  in  which  it  is  possible  to  produce  an  optimal  coloring  for  register 
conflict  graphs  of  loops  and  conditionals  in  polynomial  time.  In  particular  we  discuss  cases  in  which  it  is 
possible  to  simplify  conditionals  and  loops  without  altering  the  register  conflict  graph.  Goal  is  to  create 
straight  line  code  that  is  equivalent  to  loops  and  conditionals  for  the  purpose  of  register  allocation.  We  will 
see  that  the  simplifications  to  obtain  straight  line  code  enable  us  to  locate  clique  separators  in  the  register 
conflict  graph. 

5.1.  Register  conflict  graphs  for  straight  line  code 

We  first  turn  our  attention  to  register  conflict  graphs  of  straight  line  code.  We  show  how  the  live  ranges 
in  a  register  conflict  graph  for  straight  line  code  can  be  mapped  to  intervals  on  the  real  line.  We  then  show 
how  the  cliques  in  the  register  conflict  graphs  can  be  ordered  to  form  a  sequence  of  cliques,  and  use  that  to 
show  that  the  node  removal  technique  is  able  to  produce  an  optimal  coloring  in  polynomial  time. 

We  will  show  that  register  conflict  graphs  of  straight  line  code  are  interval  graphs  by  mapping  every  live 
range  in  the  flow  graph  of  straight  line  code  to  an  interval  on  the  real  line.  The  first  step  is  to  define  a 
monotonously  increasing  function/:  {bj,...^n}  —»R  as  follows:  J[bt)=i,  ie  {l,..,n}.  Note  that /is  both 
monotonely  increasing  and  bijective. 

Definition  1:  (Interval  on  the  real  line)  An  interval  on  the  real  line  <  n2>  is  a  subset 

of  R  such  that  nj6  R  and  ^ e  R  and  [n,^]  and  V{yly<n,}  ye  [nx,n$  and 

V{y|y>n2)  yefn,^]. 
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Definition  2:  ( Interval  graph)  An  interval  graph  is  a  graph  whose  vertices  v  can  be  represented 
by  intervals  /v  of  the  real  line  such  that  two  vertices  are  adjacent  if  and  only  if  the  corresponding 
intervals  intersect 

Because  the  live  ranges  in  straight  line  code  are  contiguous  sets  of  basic  blocks,  we  can  map  a  straight 
line  live  range  [bi,...Jbj]  :o  an  interval  on  the  real  line  by  the  function  g  defined  as  follows: 

Given  a  register  conflict  graph  G  for  straight  line  code,  there  is  an  edge  between  two  live  ranges  iff  their 
intersection  is  not  empty.  Let  G'  be  the  graph  derived  from  G  by  applying  the  function  g  to  each  node  in 
G.  Then  G'  is  an  interval  graph,  and  a  coloring  for  G'  can  be  used  to  color  G  as  well.  Interval  graphs  can 
be  colored  optimally  in  polynomial  time  [Gavril  72]  and  hence  an  optimal  global  register  allocation  can  be 
found  in  polynomial  time  for  straight  line  code. 

Figure  5-1  demonstrates  how  live  ranges  are  mapped  to  intervals.  Four  live  ranges,  for  variables  a,  b,  c 
and  d  are  shown.  The  live  range  for  a  consists  of  basics  blocks  (B1 JB2J33  J35J36J37  J38]  and  is  mapped 
to  the  interval  [1,8]  etc.  The  real  line  is  shown  on  to  the  left  The  corresponding  interval  graph  is  shown  to 
the  right  of  the  figure. 


a 


Uv*  range* 

Figure  5-1:  Register  conflict  graph  for  straight  line  code 

One  nice  property  of  interval  graphs  is  that  it  is  easy  to  identify  all  the  cliques  they  contain.  In  the  next 
section  we  will  show  how  all  cliques  in  an  interval  graph  are  identified  and  how  an  order  on  all  cliqu  s  is 
defined. 

Definition  3:  (Clique)  A  clique  in  a  graph  is  a  subset  of  vertices  such  that  every  pair  of  distinct 
vertices  in  that  subset  is  connected  by  an  edge. 
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It  is  easy  to  see  that  each  node  in  a  clique  in  a  register  conflict  graph  must  be  colored  with  a  different 
color.  The  horizontal  lines  in  the  left  part  of  Figure  5-1  are  drawn  at  the  end  points  of  the  intervals  that 
correspond  to  the  live  ranges.  Live  ranges  that  intersect  the  same  horizontal  line  form  a  clique  in  the 
corresponding  interval  graph.  A  new  clique  starts  at  each  end  point  of  an  interval,  shown  as  horizontal 
lines  in  Figure  5-1.  In  the  following  let  n  be  the  number  of  distinct  live  ranges.  Then,  the  number  of 
pairwise  distinct  end  points  of  intervals  is  at  most  2n.  This  "sequence"  of  horizontal  lines  is  the  idea  of 
ordering  the  cliques  in  an  interval  graph,  which  is  formalized  in  the  next  paragraphs. 

We  first  give  a  slightly  modified  definition  for  cliques  that  allows  us  to  map  cliques  to  the  endpoints  of 
the  intervals  corresponding  to  live  ranges  in  the  clique. 

Definition  4:  ([if]  clique)  We  say  that  a  set  of  intervals  S  forms  a  [i/l  clique  iff 
1.  [ij]QS  Vj€  S 

2.3 se  S such  that  [i'f]  is  not  a  subset  of  s  Vf' <  j  or 
3.  3s  e  5  such  that  [if]  is  not  a  subset  of  s  V/  >j. 

In  other  words,  [i  J\  is  the  largest  interval  contained  in  every  element  of  S. 

In  Figure  5-1,  the  interval  [6,8]  is  contained  in  the  intervals  for  live  ranges  a,  b  and  c.  Because  the 
interval  of  live  range  a  ends  at  8,  and  the  interval  of  live  range  c  starts  at  6,  [6,8]  is  the  maximal  interval 
contained  in  all  three  live  ranges.  Hence,  (ajb,c)  is  a  [6,8]  clique.  The  intervals  for  a  and  b  both  contain 
[4,8].  Hence,  [a,b]  is  a  [4,8]  clique. 

We  can  now  define  an  ordering  on  the  [i/|  cliques  of  an  interval  graph  as  follows: 

Definition  5:  (The  .<.  Oder  for  cliques)  Let  [f;,f2]  and  [t3,t4]  be  two  intervals  and  c7  be  a  [ry,r2] 
clique  and  c2  be  a  [t3,t4]  clique.  We  say  that  c}  .<.  c2  iff  <  ty  If  f]  =  t3,  we  say  that  Cj  .=.  c2. 

Example:  In  Figure  5-1,  [aj>)  is  a  [4,8]  clique,  and  {af>£}  is  a  [6,9]  clique.  Hence,  [af>]  .<.  [a,b,c]- 
The .  < .  order  can  be  used  io  order  sets  of  cliques,  more  formally: 

Definition  6:  (Clique  sequence)  Let  clique ^clique 2,..., clique n  be  a  list  of  ij  cliques  of  an 
interval  graph.  We  say  that  this  enumeration  of  cliques  is  a  sequence  of  cliques  iff 
Vkje  [l,...^i]  k<l  -* clique k.<,. clique t. 

We  have  seen  that  register  interference  graphs  for  straight  line  code  are  interval  graphs,  because  each  live 
range  ran  be  mapped  to  an  interval  on  the  real  line. 

We  will  now  show  how  we  can  obtain  a  sequence  of  cliques  (Definition  6)  from  the  interval  graph  that 
denotes  live  ange  conflicts  of  straight  line  code.  The  idea  of  the  construction  of  a  clique  sequence  consists 
of  several  parts.  First,  we  construct  a  set  S’  of  ordered  non-overlapping  intervals  from  the  start-  and 
endpoints  of  the  live  ranges  such  that  S’  covers  all  intervals  of  live  ranges.  We  call  this  an  interval  cover  of 
the  live  ranges.  For  each  interval  se  S',  we  determine  the  set  of  live  ranges  that  overlap  with  s,  called 
cliques(s).  This  set  of  live  ranges  must  form  a  clique,  and  we  show  that  each  such  clique  is  a  [r y,f2]  clique, 
where  s  =  [fy,f2].  We  then  show  that  since  the  elements  of  S’  are  ordered,  that  if  a  live  range  sy  starts  at  an 
earlier  basic  block  than  a  live  range  s2  then  clique(s^).  <  .clique(s2)-  Hence  we  have  fc  jnd  a  sequence  of 
cliques  that  covers  all  the  cliques  in  an  interval  graph.  Now  more  formally: 
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Definition  7:  (Interval  cover)  Let  {[a^,] . [<*„,&„]}  be  an  ordered  set  of  intervals,  i.e. 

Vie  [l,...,n-l)  atiaM.  The  ordered  set  of  intervals  {[a\J>\] . [a'mJb'm])  is  called  an  interval 

cover  of  {[alJ>{\,...,[anJt>J)  iff 

1-a’t-  aj 

2.  b*  =  b_ 

m  n 

3.  Vie  { 1 }  b-  alV,-l 

4.  V/e  { 3  ie  such  that  a-=ais/bj=bi 

For  an  example,  we  turn  again  to  Figure  5-1.  The  set  of  live  ranges  consists  of  four  intervals:  [1,8]  =  a, 
[4,15]  =  b,  [6,9]  =  c,  [10,17]  =  d.  The  interval  cover  of  those  four  live  ranges  consists  of  intervals 
{[1,3][4,5][6,8][9,10][11,15][16,17]}.  It  is  easy  to  see  that  for  each  basic  block  that  is  part  of  a  live 
range,  /(h-)=i  occurs  in  exactly  one  of  the  intervals  of  the  interval  cover.  Hence,  the  intervals  of  an  interval 
cover  are  non-overlapping. 

Definition  8:  ( S-overlap  clique  of  an  interval  Given  a  set  S  of  intervals  and  an  interval  [if],  the 
S -overlap  clique  of  [ij]  is  defined  as  [r  6  S  |  [ij]  c  • 

The  overlap  clique  of  interval  [4,5]  in  Figure  5- 1  consists  of  live  ranges  {ajb} . 

Lemma  9:  Given  the  set  S  of  live  ranges  in  a  straight  line  program  with  interval  cover 
{ [a, •&,] . [amJ> J } ,  i <j  implies  that 

(S-overlap  clique  of  [a^b;])  .<.  (S-overlap  clique  of  [aj.bj]) 

Proof:  By  definition  of  interval  cover. 

Lemma  10:  Given  the  set  S  of  live  ranges  in  a  straight  line  program,  let  {[a^] . [a„,bj }  be 

the  interval  cover  of  S  and  let  clique-  be  the  S-overlap  clique  of  interval  [a^].  Then  clique i  is  a 

[a,A]  clique- 

Proof:  By  definition  of  overlap  cliques,  the  live  ranges  that  are  members  of  the  overlap  cliques 
must  contain  the  interval  [atJb^.  Hence,  all  live  ranges  in  that  clique  contain  this  interval,  and  the 
clique  is  thus  a  (a^]  clique. 

Given  an  interval  graph  G,  we  can  construct  a  sequence  of  cliques  by  first  constructing  the  interval  cover 
for  the  nodes  in  G.  The  sequence  of  overlap  cliques  of  the  intervals  in  the  interval  cover  is  a  sequence  of 
cliques,  as  in  Definition  6.  It  remains  to  be  shown  that  this  sequence  of  cliques  is  exhaustive,  i.e.  any 
clique  in  an  interval  graph  is  contained  in  that  sequence  of  cliques. 

Theorem  11:  Let  be  the  interval  cover  of  the  nodes  of  an  interval  graph  G, 

and  let  clique j clique n  be  the  corresponding  sequence  of  overlap  cliques.  For  each  clique  c  in 

G  3  at  least  one  i  e  { l,...,n]  s:  ch  that  c  is  contained  in  clique  ^ 

Proof:  Given  the  members  m1 . mm  of  clique  c,  there  must  be  at  least  one  basic  block  bj  that  is 

contained  in  each  live  range  m^ke  { l,../n],  otherwise  c  would  not  be  a  clique.  By  construction 

of  interval  covers,  there  is  exactly  one  interval  la.Jb^  e  {[uj.hj] . [an  bn]}  such  that  bj  e 

[opJbpl-  By  Lemma  10,  clique p  of  interval  [apJ?^  is  an  [apJb^  clique,  hence  all  live  ranges 
containing  bj  are  in  cliquep  -  q.e.d. 

Theorem  1 1  stales  that  the  cliques  in  an  interval  graph  can  be  ordered  to  form  a  sequence  of  cliques. 
Therefore  it  is  easy  to  determine  the  size  of  the  largest  clique  in  an  interval  graph,  and  thus  k  colorability 
can  be  determined  in  polynomial  time.  We  will  now  show  that  the  node  removal  technique  colors  any 
interval  graph  optimally.  Since  we  know  the  clique  sequence  for  an  interval  graph  denoting  register 
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conflicts,  it  is  easy  to  find  the  largest  clique.  If  the  largest  clique  is  not  larger  than  k,  the  node  removal 
technique  will  produce  an  optimal  coloring  in  linear  time.  This  is  stated  in  the  next  theorem  and  lemma. 

Theorem  12:  Given  an  interval  graph  G  in  which  the  size  of  the  largest  clique  is  k,  there  exist 
at  least  two  nodes  v;  and  v2  in  G  that  have  at  most  k-1  neighbors  in  G. 

Proof:  Let  clique j,... , clique  be  the  clique  sequence  for  G  constructed  from  the  interval  cover 
of  G.  Let  clique f  be  the  first  clique  in  that  sequence  such  that  there  is  a  node  v  that  is  in  clique 
but  not  in  clique  (.  Hence,  for  je  (2  clique j_xc  clique  j.  Because  the  size  of  the  largest 
clique  is  k,  clique  can  be  at  most  of  size  L  Because  v  can  only  be  a  member  of  the  cliques 

[cliquej clique ,},  clique  t  has  "accumulated"  all  clique  members  of  [clique , clique •.;},  v  can 

have  at  most  k-1  neighbors. 

Let  cliquej  .<.  cliquen  be  the  last  clique  to  which  a  new  live  range  w  is  added,  that  is 
V*e  1}  clique t+1c clique *.  Hence,  cliquej  is  the  largest  clique  in  the  subsequence 

clique  j, clique . cliquen,  and  w>  is  in  each  clique  of  that  subsequence.  Hence,  w  can  not  have 

more  than  k-1  neighbors. 

Lemma  13:  Let  G  be  an  interva’  graph  in  which  the  size  of  the  largest  clique  is  k.  Then  the 
node  removal  technique  introduced  in  [Chaitin  81]  will  determine  k  colorability  of  G. 

Proof:  By  Theorem  12,  there  must  be  at  least  two  nodes  v  and  w  in  G  with  less  than  k 
neighbors.  Hence,  both  v  and  w  can  be  removed.  Let  the  graph  G'  derived  from  G  by  removing 
v  and  w.  G’  is  an  interval  graph  in  which  the  largest  clique  has  at  most  k  members,  and  hence 
there  are  at  least  two  more  nodes  in  G'  that  are  removable.  G'  is  guaranteed  to  get  smaller  after 
each  removal  step,  and  eventually  G'  is  the  empty  graph  -  q.e.d. 


We  have  shown  that  the  standard  node  removal  technique  will  determine  a  ^-coloring  for  an  interval 
graph  if  there  exists  one.  In  general,  the  standard  method  fails  to  provide  a  ^-coloring  for  graphs  that  are 
non-interval  graphs.  In  the  course  of  this  chapter  we  will  discuss  the  shape  of  register  conflict  graphs  as 
they  occur  for  complex  programming  constructs  like  loops  and  conditionals.  We  will  see  that  register 
conflict  graphs  for  loops  and  conditionals  can  be  arbitrary  graphs.  Even  if  a  1: -coloring  exists  for  such 
conflict  graphs,  the  standard  node  removal  technique  might  be  unable  to  produce  one.  We  will  see  that 
there  are  cases  in  which  structural  analysis  of  register  conflict  graphs  enables  us  to  find  k-colorings  that  are 
not  detected  by  the  standard  node  removal  technique.  We  first  discuss  aspects  of  register  conflict  graphs 
for  loops. 


5.2.  Straight  line  loops  and  circular  arc  graphs 

We  first  concentrate  on  register  conflict  graphs  for  loops  that  consist  of  loop  entry  entry,  loop  exit  exit 
and  a  loop  body  b  that  consists  of  straight  line  code.  We  will  show  that  the  class  of  register  conflict  graphs 
for  such  simple  loops  contains  arbitrary  circular  arc  graphs,  and  that  such  graphs  are  therefore  NP  hard  to 
color  optimally. 

Figure  5-2  shows  a  loop  that  consists  of  a  sequence  of  definition  and  use  statements.  Each  definition  or 
use  statement  forms  a  basic  block.  The  numbers  of  the  basic  blocks  are  shown  to  the  right  of  each 
statement.  Thus  the  loop  shown  in  Figure  5-2  consists  of  loop  entry  1,  loop  exit  13  and  the  straight  line 
loop  body  {2,3,4,5,6,7,8,9,10,1 1 ,12} .  The  loop  contains  loop-continuous  live  ranges  for  b,  c,  d,  e  and/ and 
of  one  loop-broken  live  range  for  a,  namely  {12,13,1,2,3,4,5,6}.  Each  loop-continuous  live  range  of  the 
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loop  can  be  mapped  to  an  interval  on  the  real  line,  since  the  loop  body  consists  of  straight  line  code.  The 
live  range,  for  b  consists  of  basic  blocks  (8,9,10,1 1 }  and  hence  b  is  mapped  to  the  interval  [8,1 1].  Because 
a’s  live  range  is  a  loop-broken  live  range,  it  consists  of  basic  blocks  that  are  not  contiguous.  By  defimuon 
of  loop-broken  live  ranges,  there  is  a  definition  of  a  whose  next  use  can  only  be  reached  via  the  loop’s 
backarc.  Hence,  a’s  live  range  can  not  be  mapped  to  a  single  interval.  Instead,  a’s  live  range  consists  of  a 
set  of  intervals  -  one  interval  per  contiguous  set  of  basic  blocks.  In  Figure  5-2,  the  live  range  for  a 
({12,13,1,2,3,4,5,6})  is  mapped  to  two  intervals,  [12,13]  and  [1,6]  respectively.  The  mapping  of  the  live 
ranges  in  the  loop  to  intervals  is  shown  to  the  right  of  the  loop  construct  Note  that  there  is  a  dotted  line 
between  the  two  intervals  that  represent  the  live  range  a.  This  means  that  both  intervals  represent  the  same 
live  range.  The  dotted  portion  between  the  two  parts  of  the  live  range  for  a  corresponds  to  the  hole  in  live 
range  a  consisting  of  [7,8,9,10,11]. 


Figure  5-2:  Register  conflict  graph  for  loop 


A  loop  is  usually  executed  many  times,  i.e.  if  the  loop  of  Figure  5-2  were  executed,  the  definition  of  a  in 
block  12  would  be  followed  by  blocks  13,  and  then  1,  hence  the  execution  sequence  of  tne  basic  blocks  that 
form  the  live  range  for  a  is  contiguous.  Multiple  executions  of  a  loop  can  be  expressed  by  a  circle 
consisting  of  the.  set  of  basic  blocks  that  form  a  straight  line  loop.  This  is  depicted  in  Figure  5-3.  The  basic 
blocks  that  form  the  same  loop  as  in  Figure  5-2  are  arranged  along  a  circle.  The  execution  order  is 
clockwise,  hence;  basic  block  13  is  followed  by  basic  block  1  etc.  Given  this  representation  of  a  loop,  the 
live  ranges  of  the  loop  can  be  arranged  around  that  circle.  This  is  shown  in  Figure  5-4.  The  live  ranges  are 
segments  of  the  circle  formed  by  the  basic  blocks  of  the  loop.  Hence,  the  live  range  for  b  consists  of  the 
segment  between  basic  blocks  8  and  11,  and  the  live  range  for  a  is  the  (continuous)  segment  between  12 
and  6. 
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Circular  arc  graph 

Figure  5-4:  Register  conflict  graph  for  loop 

Definition  14:  (Circular  arc  graph)  A  circular  arc  graph  is  a  graph  whose  vertices  can  be 
represented  as  segments  of  a  circle  such  that  two  vertices  are  adjacent  if  and  only  if  the 
corresponding  circle  segments  overlap. 

It  is  easy  to  see  that  register  conflict  graphs  for  loops  can  be  arbitrary  circular  arc  graphs.  Optimal 
coloring  of  an  arbitrary  circular  arc  graph  is  NP  complete  [Garey,  M.R.  and  Johnson,  D.S.  79].  Since  for 
an  arbitrary  circular  arc  graph  a  loop  can  be  constructed  such  that  the  live  ranges  of  the  loop  arranged  on  a 
circle  are  equal  to  the  vertices  of  the  circular  arc  graphs,  register  conflict  graphs  for  arbitrary  loops  are  NP 
hard  to  color. 

In  the  next  paragraphs,  we  discuss  situations  in  which  the  register  conflict  graph  of  a  loop  L  is 
"independent"  of  the  backarc  in  L. 
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53.  Simplifying  loops  for  the  purpose  of  global  register  allocation 

We  have  seen  in  the  previous  section  that  the  presence  of  a  loop-broken  live  range  was  necessary  to 
construct  a  register  conflict  graph  that  is  an  arbitrary  circular  arc  graph.  At  least  one  connected  component 
of  the  live  range  graph  of  a  loop-broken  live  range  must  contain  the  backarc  of  the  loop  by  definition  of 
loop-broken.  Removing  the  backarc  from  a  loop  causes  that  connected  component  to  be  separated  into  two 
unconnected  pieces  -  the  corresponding  register  conflict  graph  changes.  We  now  show  situations  in  which 
the  register  conflict  graph  of  a  loop  L  is  unaltered  when  the  backarc  is  removed  from  L. 


5  J.l.  Removing  the  backarc  in  the  absence  of  loop-broken  live  ranges 

If  all  live  ranges  in  a  loop  L  are  continuous,  no  live  range  can  have  any  holes.  The  register  conflict  graph 
of  the  program  L'  derived  from  L  by  removing  the  backarc  is  identical  to  the  register  conflict  graph  of  L. 
Given  a  loop  L  with  a  loop  body  that  consists  of  complex  programming  constructs,  the  backarc  of  L  can 
still  be  ignored,  as  long  as  all  live  ranges  are  loop-continuous  in  L.  In  other  words,  the  backarc  of  a  loop  L 
can  be  ignored  for  the  purpose  of  global  register  allocation  if  L  does  not  cause  any  holes  in  live  ranges. 
This  is  stated  in  the  next  Lemma. 

Lemma  15:  Given  a  loop  L  with  loop  head  h  and  loop  exit  e  that  does  not  contain  any 
loop-broken  live  ranges  and  in  which  all  continuous  equivalent  live  ranges  have  been  changed 
into  continuous  live  ranges  by  adding  the  appropriate  instructions  described  in  Lemma  11  in 
Chapter  4,  let  V  be  the  flow  graph  derived  from  L  by  removing  the  edge  from  node  e  to  node  h 
(the  backarc).  Then  the  register  conflict  graphs  for  L  and  L"  are  identical. 

Proof:  The  set  of  live  ranges  in  L’  is  equal  to  the  set  of  live  ranges  in  the  original  loop  L. 
Otherwise,  there  would  be  a  live  range  t  =  (bj,...Jbn)  in  the  derived  loop  L'  but  not  in  the  original 
loop  L.  This  is  only  possible  if  {bl„..JbJ}  is  part  of  a  connected  component  in  a  variable  v’s  live 
range  graph  in  L  that  contains  V s  backarc,  and  removing  the  backarc  causes  a  partition  of  that 
connected  component  Let  {bx,...Jbi_xJbj,...Jbn}  be  that  connected  component  in  v’s  live  range 
graph  in  L.  By  definition  of  live  range  graphs,  b:  must  contain  a  definition  of  v  that  is  not 
preceded  by  a  use  of  v  in  bj,  and  removing  L’s  backarc  disconnects  {£],...  and  {bj,...,bn}. 
Then,  there  must  be  a  basic  block  bu  e  [bx,...  that  uses  the  definition  in  bj  such  that  bu  is 
part  of  L  but  not  part  of  any  conditional  or  loop  nested  inside  L.  By  prerequisite  all  broken  live 
ranges  must  contain  a  hole;  if  removing  L’s  backarc  causes  a  separation  of  a  broken  live  range, 
the  hole  of  live  range  {bv...Jb^_xJb^..Jb^\  must  be  linked  to  L  -  a  contradiction  because  by 
prerequisite  all  live  ringes  in  L  are  loop  continuous. 

Because  the  set  of  live  ranges  in  both  L  and  L’  is  identical,  the  set  of  edges  between  the  live 
ranges  must  be  identical.  Hence,  the  conflict  graphs  for  L  andL’  are  identical  -  q.e.d. 

The  example  shown  in  Figure  5-5  illustrates  Lemma  15.  The  flow  graph  labeled  FI  is  a  loop  that  contains 
two  loop-continuous  live  ranges,  one  for  a  and  one  for  b  respectively.  The  flow  graph  labeled  F2  is  derived 
from  FI  by  removing  the  backarc.  The  conflict  graph  for  FI  is  equal  to  the  conflict  graph  for  F2,  and  is 
shown  to  the  right. 

As  a  consequence  of  Lemma  15,  the  register  conflict  graph  for  a  program  that  contains  no  conditionals 
and  in  which  all  live  ranges  are  continuous  is  equivalent  to  the  register  conflict  graph  of  straight  line  code. 


We  have  seen  how  the  absence  of  loop-broken  live  ranges  in  a  loop  L  enables  us  to  simplify  the  flow 
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FI  F2  Conflict  graph  (or  FI  and  F2 


Figure  5-5:  Register  conflict  graph  for  a  loop  with  forward  live  ranges 
graph  by  removing  the  backarc  of  L  for  the  purpose  of  global  register  allocation.  If  there  is  a  loop-broken 
live  range  in  a  loop  L,  removing  L's  backarc  in  general  results  in  a  different  register  conflict  graph, 
depicted  in  Figure  5-6.  The  loop  depicted  to  the  left  has  a  loop-broken  live  range  Fra  consisting  of  basic 


Figure  5-6:  Removing  die  backarc  in  the  presence  of  loop-broken  live  ranges 
blocks  (1,4).  The  register  conflict  graph  for  that  loop  is  depicted  below.  Removing  the  backarc  partitions 
the  live  range  for  a  into  two  separate  live  ranges,  { 1 )  and  (4)  respectively.  This  leads  to  an  extra  node  for 
a  in  the  register  conflict  graph  -  one  per  live  range.  The  register  conflict  graph  for  the  flow  graph  to  the 
right  is  again  shown  below  and  differs  from  the  register  conflict  graph  of  the  original  loop.  In  the  next 
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section,  we  discuss  restrictions  that  must  be  met  if  the  register  conflict  graph  of  a  loop  L  is  independent  of 
L’s  backarc  in  the  presence  of  loop-broken  live  ranges. 


S32.  Removing  the  backarc  in  the  presence  of  loop-broken  live  ranges 


Given  a  loop  L  with  a  loop-broken  live  range  v,  v  must  contain  a  hole  that  is  entirely  contained  in  L.  Note 
that  by  definition  of  loop  broken,  the  hole  can  not  be  entirely  contained  in  any  programming  construct 
nested  inside  L.  We  can  therefore  partition  a  loop-broken  live  range  into  two  parts.  The  first  part  of  the 
live  range  consists  of  basic  blocks  in  the  top  part  of  the  loop  and  contains  the  loop  entry,  and  the  second 
part  of  the  live  range  consists  of  basic  blocks  in  the  bottom  part  of  the  loop  and  contains  the  loop  exit  In 
the  following,  we  call  those  parts  TOP  and  BOT.  Before  we  give  a  formal  definition  of  the  TOP  and  BOT 
parts  of  a  loop-broken  live  range  we  demonstrate  the  concept  with  a  few  examples. 


Figure  5-7  depicts  a  sample  loop  with  one  loop-broken  live  range  for  variable  a,  consisting  of  { 1,2, 7,8} 
The  live  range  has  one  hole  consisting  of  {3,44,6}.  The  live  range  is  partitioned  into  the  top  part 


Figure  5-7:  TOP  and  BOT  part  of  a  loop-broken  live  range 

TOP=  { 1,2}  containing  the  loop  entry  and  the  bottom  part  BOT  =  {7,8}  containing  the  loop  exit  A  more 
complex  example  is  shown  in  Figure  5-8.  A  complex  loop  is  shown,  consisting  of  loop  head  BO,  loop  exit 
B5  and  a  loop  body  that  consists  of  a  conditional.  In  that  loop,  a  is  a  loop-broken  live  range;  the  BOT  part 
of  a  consists  of  [B2JB4JB5]  and  the  TOP  part  of  a  consists  of  {BO}.  Note  that  the  definitions  of  a  in  basic 
blocks  B2  and  B4  occur  in  different  branch  clauses.  Therefore  the  basic  blocks  that  form  the  BOT  part  of 
the  live  range  for  a  do  not  form  straight  line  code.  The  BOT  part  of  the  live  range  for  a  consists  of  the 
connected  component  of  the  live  range  graph  for  a  that  contains  the  loop  exit  We  are  now  ready  to  define 
the  TOP  and  BOT  part  of  a  loop-broken  live  range  formally. 
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B3 


B4 


Figure  5-8:  TOP  and  BOT  part  of  a  loop-broken  live  range  in  a 

complex  loop 

Definition  16:  (TOP  and  Bi  f  set  of  a  loop-broken  live  range)  Given  a  loop-broken  live  range 
for  a  variable  v  of  a  loop  L  with  loop  head  h  and  loop  exit  e  and  the  live  range  graph  for  v, 
TOPlyJL )  consists  of  the  basic  blocks  that  form  the  connected  component  of  v’s  live  range  graph 
that  contains  h.  BOTlyl)  consists  of  the  basic  blocks  that  form  the  connected  component  of  the 
live  range  graph  that  contains  e. 

In  the  next  paragraphs  we  will  discuss  situations  that  permit  to  eliminate  the  basic  blocks  that  form  the 
TOP  part  or  the  BOT  part  from  a  loop  broken  live  range  without  changing  the  register  conflict  graph. 

Definition  17:  (Definition  starting  the  BOT  part  of  a  loop-broken  live  range)  Given  a  loop- 
broken  live  range  for  a  variable  x  in  a  loop  L,  we  say  that  a  definition  of  x  in  basic  block  bz  starts 
BOT(xJS)  iff  bxe  BOT(xfi). 

Example:  in  the  loop  depicted  in  Figure  5-8,  basic  blocks  B2  and  B4  both  contain  a  definition  of  a  and  are 
members  of  the  set  BOT(aJS).  Hence,  both  definitions  start  the  BOT  part  of  the  live  range  for  a. 

Given  a  loop  L  ;n  which  no  loop-continuous  live  range  contains  a  basic  block  that  occurs  in  BOT(yJL)  of  a 
loop-broken  live  range  y,  it  is  easy  to  see  that  in  the  register  conflict  graph  the  "bottom  part”  of  y  can  be 
ignored  in  the  loop,  illustrated  in  figure  5-9.  In  the  loop  depicted  to  the  left,  loop-broken  live  range  a 
consists  of  TOP  part  (1,2)  and  BOT  part  {6,7 } .  Neither  loop-continuous  live  range  b  nor  c  overlap  with  a 
in  hasic  blocks  7  or  8.  The  BOT  part  consisting  of  basic  blocks  7  and  8  can  be  eliminated  by  removing  a’s 
definition  in  basic  block  7  and  the  subsequent  uses  in  the  loop  that  are  reachable  on  paths  that  do  not 
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Figure  5-9:  Removing  the  bottom  part  of  a  loop-broken  live  range 
contain  the  loop’s  backarc.  This  is  depicted  to  the  right.  For  both  loops  the  register  conflict  graphs  are 
shown  below  -  they  are  identical. 

Note  that  the  removal  of  the  definition  that  starts  the  BOT  part  of  o’s  live  range  does  not  alter  the  register 
conflict  graph  even  if  there  is  a  loop-continuous  live  range  that  contains  basic  blocks  of  both  the  TOP  part 
and  the  BOT  part  of  a.  This  is  demonstrated  in  Figure  5-10.  loop-continuous  live  range  b  contains  basic 
blocks  from  both  the  TOP  part  and  the  BOT  part  of  a.  Still  the  register  conflict  graph  does  not  change 
when  the  definition  of  a  in  basic  block  7  is  removed. 

Given  a  loop  L,  removing  definitions  and  uses  that  form  the  BOT  part  of  a  loop-broken  live  range  /  is 
equivalent  to  eliminating  the  entire  BOT  part  of  l.  If  we  enforce  that  every  live  range  that  overlaps  with  the 
TOP{lJS)  of  a  loop-broken  live  range  l  also  overlaps  with  BOTQT),  then  the  entire  BOT  part  of  /  can  be 
eliminated  from  /  without  altering  the  register  conflict  graph.  These  observations  are  formalized  in  the  next 
lemma 

Lemma  18:  Given  a  flow  graph  F  with  register  conflict  graph  G  that  contains  a  loop  L 
consisting  of  basic  blocks  [h^,,„Jbn)  with  loop-broken  live  ranges  ln,  let  l’  be  derived  from 
by  removing  all  basic  blocks  that  are  in  BOTQjJJ)  V  ye  {l,...,m},  and  let  G'  be  derived  from  G 
by  exchanging  Ij  for  Ijti j e  (l,...,m).  If  every  live  range  that  contains  a  basic  block  of  the  BOT 
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Figure  5-10:  loop-continuous  live  range  overlapping  with  both  the  BOT  and  TOP 

set  of  a  loop-broken  live  range 

set  of  a  loop-broken  live  range  aLo  ;»n  tains  at  least  one  basic  block  of  the  TOP  set  of  the  same 
live  range,  G  and  G'  are  identical. 

Proof:  The  nodes  in  G  consist  c»f  nodes  representing  loop-broken  live  ranges  and  of  nodes 
representing  loop-continuous  live  ranges.  Every  loop-continuous  live  range /  that  co-exists  with 
a  loop-broken  live  range  l  contains  either  just  basic  blocks  that  occur  in  members  of  TOP(IJL)  or 
both  basic  blocks  that  occur  in  members  of  TOP(IJ^)  and  basic  blocks  that  occur  in  members  of 
BOT(IJ^).  Therefore  the  node  representing/ exists  both  in  G  and  G’.  Further,  the  set  of  edges 
incident  to  the  node  for /  must  be  identical  in  both  G  and  G'  because  if/co-exists  with  /  in  a  basic 
block  occurring  in  a  member  of  BOT(lJ. ,),  it  must  also  co-exist  with  /  in  a  basic  block  occurring 
in  a  member  of  TOP(IL)  and  therefore  there  must  be  an  edge  between /  and  /  in  G'.  Since  all 
other  live  ranges  of  L  are  unaltered,  G  and  G’  are  equal  -  q.e.d. 


Note  that  Lemma  18  also  holds  if  the  terms  BOT  and  TOP  are  exchanged.  This  is  illustrated  in  Figure 
5-11.  The  live  ranges  of  a  loop  are  depicted  to  the  left;  a  is  a  loopbroken  live  range  with  TOP  part  {1,2} 
and  BOT  part  {7,8}.  Live  range  b  is  the  only  loop-continuous  live  range  adjacent  to  a,  and  it  shares  with 
the  live  range  for  a  only  basic  blocks  that  are  in  the  BOT  part  of  a.  Removing  the  uses  of  a  in  basic  blocks 
1  and  2  eliminates  the  TOP  part  of  a  -  the  register  conflict  graph  for  both  loops  is  identical  and  shown  at  the 
bottom  of  the  figure. 

Lemma  19:  Given  a  flow  graph  F  with  register  conflict  graph  G  that  contains  a  loop  L 

consisting  of  basic  blocks  (6j  ,...,£„}  with  loop-broken  live  ranges  lj . /„,  let  lj  be  derived  from  lj 

by  removing  all  basic  blocks  that  are  in  TOPQjJL,)  V  je  (l,...,m},  and  let  G'  be  derived  from  G 
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Figure  5-11:  Removing  the  TOP  part  of  a  loop-broken  live  range 

by  exchanging  Ij  for  hVye  { l,...,m}.  If  every  live  range  that  contains  a  basic  block  of  the  TOP 
set  of  a  loop-broken  bve  range  also  contains  at  least  one  basic  block  of  the  BOT  set  of  the  same 
live  range,  G  and  G‘  are  identical. 

Proof:  The  nodes  in  G  consist  of  nodes  representing  loop-broken  live  ranges  and  of  nodes 
representing  loop-continuous  live  ranges.  Every  loop-continuous  live  range/ that  co-exists  with 
a  loop-broken  live  range  l  contains  either  just  basic  blocks  that  occur  in  members  of  BOT(I  L)  or 
both  basic  blocks  that  occur  in  members  of  BOT(IT)  and  basic  blocks  that  occur  in  members  of 
TOP (IJ*).  Therefore  the  node  representing/  exists  both  in  G  and  G'.  Further,  the  set  of  edges 
incident  to  the  node  for/must  be  identical  in  both  G  and  G'  because  if/co-exists  with  /  in  a  basic 
block  occurring  in  a  member  of  TOP(IJL),  it  must  also  co-exist  with  l  in  a  basic  block  occurring 
in  a  member  of  BOT(IJL)  and  therefore  there  must  be  an  edge  between /and  l  in  G'.  Since  all 
other  live  ranges  of  L  are  unaltered,  G  and  G'  are  equal  -  q.e.d. 


Note  that  the  removal  of  basic  blocks  described  in  Lemmas  18  or  19  turn  loop-broken  live  ranges  into 
loop-continuous  live  ranges.  If  all  loop-broken  live  ranges  can  be  changed  to  loop-continuous  live  ranges, 
the  backarc  of  the  loop  can  be  removed  without  changing  the  register  conflict  graph  of  the  loop  (Lemma 
15).  By  definition,  both  the  TOP  and  the  BOT  part  of  a  loop  broken  live  range  l  m  a  loop  L  can  contain 
basic  blocks  outside  the  loop  L.  Because  we  enforce  that  ever y  live  range  that  overlaps  with  TOP(lJS)  also 
overlaps  with  BOT(lJL ),  we  can  remove  the  loop  backarc  without  changing  the  register  conflict  graph  of  the 
entire  program. 
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All  that  was  discussed  in  section  5.3.2  can  be  applied  to  conditional-broken  live  ranges:  if  all  live  ranges 
that  overlap  with  a  conditional-broken  live  range  b  overlap  with  both  the  TOP  and  the  BOT  set  of  b,  the 
overall  register  conflict  graph  does  not  change  when  the  TOP  or  the  BOT  part  of  b  are  removed.  Hence,  the 
conditional-broken  live  range  b  has  been  changed  into  a  conditional-continuous  live  range. 

We  have  seen  that  in  the  absence  of  loop-broken  live  ranges,  the  backarc  of  a  loop  can  be  removed 
without  altering  the  register  conflict  graph.  In  other  words,  the  flow  graph  that  previously  contained  a  loop 
has  been  "straightened  out"  and  now1  no  longer  contains  that  loop.  The  simplification  of  loops  depends 
merely  on  the  absence  of  loop-broken  live  ranges.  For  conditionals,  the  absence  of  broken  live  ranges  is 
not  enough  to  ensure  a  register  conflict  graph  that  is  easy  to  color  optimally.  We  will  see  that  even  if  every 
live  range  in  a  conditional  is  continuous,  register  conflict  graphs  for  conditionals  can  be  arbitrary  graphs 
that  are  hard  to  color  optimally. 

5.4.  Register  conflict  graphs  for  conditionals 

Throughout  this  section  we  assume  that  no  loop  that  is  part  of  a  branch  clause  contains  loop-broken  live 
ranges.  Examining  register  conflict  graphs  for  conditionals  is  more  complex  than  for  loops  and  straight  line 
code,  because  there  is  a  larger  variety  of  live  range  types,  and  there  are  alternative  paths  through  the 
branch. 

Definition  20:  (Path  through  a  conditional  branch)  A  path  through  a  conditional  branch  with 
split  node  s  and  join  node  j  is  a  sequence  of  basic  blocks  [bv...J?n]  such  that  b,=s. 

Vie  {l,...,n),  b,-  occurs  in  a  branch  clause  and  Vie  {2,...^i}<bM^J>  is  an  edge  in  the  flow  graph 
denoting  the  branch. 

Examples  for  paths  through  the  conditional  branch  depicted  in  Figure  5-12  are  the  sequences 
{B0-»Bl-»B3-*fl4->£8}  and  {B0-»fl5-4ff6-»fl7-»fl8},etc. 

Each  path  through  a  conditional  branch  consists  of  straight  line  code.  While  the  register  conflict  graph 
for  one  individual  path  can  be  colored  optimally  in  isolation,  it  is  in  general  not  possible  to  combine 
optimal  colorings  for  individual  paths  to  an  optimal  coloring  for  the  entire  conditional  statement  in 
polynomial  time.  An  exception  are  conditional  branches  in  which  each  live  range  is  BLOCAL,  discussed 
next. 

5.4.1.  Conditionals  in  which  all  live  ranges  are  BLOCAL 

Intuitively,  the  number  of  BGLOBAL ,  LOEN  and  LOEX  live  ranges  in  a  conditional  is  a  measure  for  the 
degree  of  dependency  between  the  register  conflict  graphs  of  the  individual  branch  clauses.  The  reason  is 
that  BGLOBAL ,  LOEN  and  LOEX  live  ranges  might  be  shared  between  the  register  conflict  graphs  of 
separate  branch  clauses.  If  each  live  range  in  a  conditional  is  BLOCAL ,  the  register  conflict  graphs  for 
individual  branch  clauses  are  completely  independent  Therefore,  conditionals  in  which  all  live  ranges  are 
BLOCAL  can  be  colored  optimally  by  combining  individual  colorings  for  each  branch  clause.  This  is 
formalized  in  the  next  theorem. 
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Figure  5-12:  Paths  through  a  conditional 

Theorem  21:  Given  a  register  conflict  graph  G  of  a  conditional  branch  with  split  node  s  and 

join  node  j,  and  a  set  of  branch  clauses  cy . c„,  the  chromatic  number  of  G  is  equal  to  the 

maximal  chromatic  number  of  Grje  {l,...,n}  if  all  live  ranges  that  occur  in  the  conditional 

i 

branch  are  BLOCAL. 

Proof:  The  conflict  graph  G  for  the  entire  conditional  can  be  partitioned  into  the  conflict  graph 
for  the  split  node  GJt  the  conflict  graph  for  the  join  node  Gj,  and  a  conflict  graph  for  each  branch 
clause  Gr-  Both  Gr  and  G,  must  be  empty,  because  the  conditional  branch  contains  only 

“  J 

BLOCAL  live  ranges  and  no  BLOCAL  live  range  can  contain  s  or  j.  For  the  same  reason,  there 
can  be  no  edge  between  the  conflict  graphs  for  the  individual  branch  clauses.  Hence, 
G=u(Gc,i€  {l,...,n})  and  the  chromatic  number  of  G  is  equal  to  the  maximal  chromatic  number 

of  the  Gc  -  q.e.d. 

Note  that  Theorem  21  holds  for  nested  conditionals,  even  if  the  inner  conditionals  do  contain  live  ranges 
that  are  not  BLOCAL.  As  a  consequence  of  Theorem  21,  the  register  conflict  graph  of  a  conditional  C  with 

split  node  5,  join  node  J,  and  branch  clauses  cl . cn  in  which  all  live  ranges  are  BLOCAL  is  equivalent  to 

the  register  conflict  graph  of  the  flow  graph  derived  from  C  by  "linearizing"  the  branch  clauses.  Because 
the  register  conflict  graphs  for  the  clauses  are  completely  unrelated,  the  order  in  which  the  branch  clauses 
are  linearized  in  the  flow  graph  is  irrelevant. 


5.4.2.  Conditionals  and  BGLOBAL  live  ranges 

The  presence  of  BGLOBAL  live  ranges  in  a  conditional  complicates  things,  because  BGLOBAL  live 
ranges  can  be  conditional-broken  live  ranges.  A  BGLOBAL  live  range  for  a  variable  v  is  conditional- 
broken  if  there  is  a  re-definition  of  v  in  some  branch  clause.  An  example  of  a  continuous  and  a  broken 
BGLOBAL  live  range  is  given  in  Figure  5-13.  The  live  range  for  variable  a  depicted  to  the  left  consists  of 


65 


Continuous  BGLOBAL  range  Broken  BGLOBAL  range 


Figure  5-13:  Continuous  and  broken  BGLOBAL  live  ranges 
basic  blocks  [BL)£\J32fi'ifi4fl5  £6,57,58},  that  is  all  basic  blocks  of  the  conditional  branch.  Hence,  the 
live  range  is  a  continuous  BGLOBAL  live  range.  The  live  range  for  a  depicted  to  the  right  in  figure  5-13 
consists  of  basic  blocks  {50,51,52,53.54,55,67,68}.  B6  is  not  part  of  the  live  range  for  a,  even  though 
there  exists  a  path  through  the  branch  that  contains  B6.  Hence,  it  is  a  broken  BGLOBAL  live  range. 

One  more  reason  why  BGLOBAL  live  ranges  can  make  the  register  conflict  graph  of  a  conditional  more 
complex  is  because  they  introduce  dependencies  between  the  register  conflict  graphs  of  individual  branch 
clauses.  In  the  next  paragraphs  we  will  see  how  the  continuity  of  BGLOBAL  live  ranges  determines  how 
hard  it  is  to  color  register  conflict  graphs  for  conditionals  that  contain  BLOCAL  and  BGLOBAL  live  ranges. 

5.4 2.1.  Continuous  BGLOBAL  live  ranges  in  conflict  graphs  for  conditionals 

A  continuous  BGLOBAL  live  range  for  a  variable  v  consists  of  all  basic  blocks  that  are  on  a  path  from  the 
split  node  to  the  join  node  of  a  branch.  Therefore,  a  continuous  BGLOBAL  live  range  overlaps  with  every 
other  live  range  in  the  conditional.  In  the  register  conflict  graph  for  a  conditional  branch,  a  node  that 
represents  a  continuous  BGLOBAL  live  range  is  therefore  adjacent  to  all  other  nodes. 

Lemma  22:  Given  the  register  conflict  graph  G  of  a  conditional  that  contains  n  continuous 
BGLOBAL  live  ranges,  let  G'  be  the  graph  derived  from  G  by  removing  all  n  nodes  that  represent 
continuous  BGLOBAL  live  ranges.  The  chromatic  number  of  G  is  equal  to  the  chromatic  number 
of  G'  +n. 

Proof:  Let  c'  be  the  chromatic  number  of  G\  and  let  each  {cj,...,cm}  be  the  set  of  colors  used 
for  the  nodes  in  G\  Each  continuous  BGLOBAL  live  range  is  adjacent  to  each  vertex  in  G’  and 
hence  must  be  colored  with  a  color  ct  (cj,...,cm).  Since  the  number  of  BGLOBAL  live  ranges  is 
n,  the  chromatic  number  of  G  is  therefore  c'  +n. 
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As  a  consequence  of  Lemma  22  continuous  BGLOBAL  live  ranges  can  be  "ignored"  during  the  coloring 
process,  because  they  require  each  a  new  color,  regardless  of  the  specific  coloring  found  for  the  other  nodes 
in  the  conflict  graph. 

If  all  BGLOBAL  live  ranges  are  conditional-continuous,  the  branch  clauses  of  a  conditional  can  be 
linearized  for  the  purpose  of  register  allocation  if  the  conditional  does  not  contain  any  LOEN  or  LOEX  live 
ranges.  The  reason  is  that  all  holes  in  BGLOBAL  live  ranges  are  contained  inside  an  inner  loop  or 
conditional  and  thus  inside  a  branch  clause.  Note  that  conditional  continuity  is  a  weaker  condition  than 
continuity. 

Broken  BGLOBAL  live  ranges  can  make  register  conflict  graphs  for  conditionals  arbitrarily  hard  to  colon 
like  loop-broken  live  ranges,  conditional-broken  live  ranges  can  cause  arbitrary  circular  arc  register  conflict 
graphs,  discussed  in  the  next  paragraphs. 

5A2.2.  Conditionals  and  broken  BGLOBAL  live  ranges 

We  show  now  that  conditionals  that  consist  of  B LOCAL  and  arbitrary  BGLOBAL  live  ranges  produce 
register  conflict  graphs  that  include  arbitrary  circular  arc  graphs,  and  are  therefore  NP  hard  to  color. 

Before  we  give  a  formal  theorem,  we  go  through  an  example  that  illustrates  the  point.  Figure  5-14  shows 
a  conditional  that  contains  a  broken  BGLOBAL  live  range  for  a. 

The  conditional  consists  of  two  branch  clauses,  one  consisting  of  the  single  block  14,  the  other  consisting 
of  a  straight  line  segment  {1,2,3,4,5,6,7,8,9,10,11,12,13}.  The  conditional  contains  one  broken  BGLOBAL 
live  range  for  variable  a,  consisting  of  basic  blocks  {0,1,2, 3,' 4,5, 6, 12, 13, 14, 15}.  The  live  range  for  a  is 
broken,  because  it  does  not  contain  all  nodes  along  the  path  through  the  branch  clause  that  starts  with  block 
1. 


Because  each  path  through  a  conditional  consists  of  straight  line  code,  the  live  ranges  of  path 
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,15}  can  be  mapped  to  intervals  on  the  real  line  like  live  ranges  of  straight 
line  code.  This  is  depicted  to  the  right  in  Figure  5-14.  The  live  range  for  a  is  mapped  to  two  intervals, 
(0,6]  and  (12,15],  because  a  is  not  live  in  all  basic  blocks  in  the  path.  Because  a’s  live  range  is  BGLOBAL, 
the  intervals  [0,6]  and  [12,15]  must  be  colored  with  the  same  color,  indicated  by  a  dotted  arc  between  them. 
Note  that  this  register  conflict  graph  is  equivalent  to  that  depicted  for  a  loop  in  Figure  5-2.  Hence,  the  class 
of  register  conflict  graphs  for  conditionals  contains  circular  arc  graphs.  In  the  next  theorem  we  show  that 
for  an  arbitrary  circular  arc  graph,  there  is  a  conditional  branch  whose  register  conflict  graph  is  equal  to  the 
circular  arc  graph. 

Theorem  23:  Given  an  arbitrary  circular  arc  graph  C,  there  exists  a  conditional  branch  B  that 
contains  only  BLOCAL  and  broken  BGLOBAL  live  ranges  such  that  the  register  conflict  graph  for 
B  is  equal  to  C. 

Proof:  Let  {[altt;],....,[a„,ft„]}  be  the  segments  of  a  circle.  Suppose  the  circle  is  labeled  with 
the  endpoints  of  the  segments  {c,,...^m}  such  that  Vie  {  1,..,/ti}3/'€  {1, ...,/!} such 
that c, = OjOrc  —  bj and Vie  {l,...,m~l}cJ<ci+1. 
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Figure  5-14:  Mapping  live  ranges  of  one  path  to  intervals 


We  construct  a  conditional  with  split  node  c0,  join  node  cm+1  and  two  branch  clauses.  The 
newly  introduced  c0  and  cm+]  are  chosen  such  that  c0  <  c}  and  cm+1  >  cm. 

The  first  branch  clause  is  empty,  i.e.  there  is  an  edge  from  the  split  node  to  the  join  node.  The 
second  branch  cla”se  consists  of  the  straight  line  code  segment  {cj,...,cm} . 


Each  circle  segment  in  {[aj.ij],...., [<»„,£,,])  is  mapped  to  a  set  of  intervals  on  the  real  line  as 
follows: 

AKb,])=f  KA1  ifflM 

L  K-^jUcoA]  if a^bi 

The  live  ranges  in  our  constructed  branch  are  the  intervals  /([a If  ai  >  bt,  the  circle 
segment  [a(>b(]  is  mapped  to  two  imerv;  ’.s,  and  both  intervals  are  part  of  the  same  live  range. 
Because  the  split  node  c0  and  the  join  node  cm+]  are  part  cf  that  live  range,  it  must  be  a 
B GLOBAL  live  range.  It  is  easy  to  see  that  the  register  conflict  graph  for  this  branch  is  equal  to 
the  original  circular  arc  graph  -  q.c.d. 

The  proof  of  23  is  best  understood  by  an  example.  Figure  5-15  depicts  a  circular  arc  graph,  with  segments 
o=[l,4]>=[3^],c=[6,10]/f=[7,8]  ande=[9,2j.  We  construct  a  conditional  B  such  that  the  register  conflict 
graph  for  B  is  equal  to  the  circular  arc  graph  in  Figure  5-15. 


The  circle  is  labeled  {1,2,3,4,5,6,7,8,9,10},  i.e.  Cj  -  1  and  cm  =  10.  We  choose  c0=0  and  cm+1  =  1 1  The 
results  of  the  function / defined  above  are  as  follows: 
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Figure  5-15:  An  arbitrary  circular  arc  graph 


(a) 

/Ul,4])=[l,4] 

( b ) 

A[3,5])=[331 

(c) 

A[6,10])  =  [6,10] 

(d) 

A[7,8])=[7,8] 

(e) 

A[93])=(f9,ll],[03]) 

We  construct  a  conditional  that  consists  of  branch  split  node  c0=0,  join  node  cm+1  =  11  and  one  empty 
branch  clause,  and  a  straight  line  branch  clause  consisting  of  basic  blocks  { 1,23,4,5,6,7,8,9,10).  The  live 
range  for  e  is  BGLOBAL ,  hence  e  is  live  at  the  split  node  and  at  the  join  node.  All  other  live  ranges  are 
BLOCAL.  Figure  5-16  shows  the  conditional  whose  register  conflict  graph  is  the  circular  arc  graph 
depicted  in  Figure  5-15. 

Note  that  e  is  a  broken  BGLOBAL  live  range  -  by  Lemma  22  and  Theorem  21  register  conflict  graphs  of 
conditionals  that  contain  just  BLOCAL  live  ranges  and  continuous  BGLOBAL  live  ranges  can  be  colored 
optimally  in  polynomial  time  and  hence  can  not  be  arbitrary  circular  arc  graphs. 

In  the  presence  of  LOEN  and  LOEX  live  ranges,  reg.ster  conflict  graphs  for  conditionals  can  become  very 
complicated.  We  will  see  that  even  if  all  live  ranges  in  a  conditional  are  continuous,  the  presence  of  LOEN 
and  LOEX  live  ranges  can  cause  an  arbitrary  register  conflict  graph.  There  is  no  easy  solution  to  why 
LOEN  and  LOEX  live  ranges  cause  arbitrary  graphs  -  we  therefore  restrict  ourselves  to  discussing  situations 
in  which  LOEN  and  LOEX  live  ranges  do  not  cause  arbitrary  register  conflict  graphs.  For  the  remainder  of 
this  chapter,  we  assume  that  all  live  ranges  in  a  conditional  are  continuous,  so  that  we  can  concentrate  on 
the  influence  of  LOEN  and  LOEX  live  ranges. 
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Figure  5-16:  Branch  constructed  from  an  arbitrary  circular  arc  graph 

5.4 3.  Conditionals  and  r,QENHGEX  live  ranges 

By  definition  of  LOEN  live  ranges,  all  WEN  live  ranges  of  a  conditional  contain  the  split  node  and 
therefore  co-exist  d  iring  program  execution.  At  the  same  time,  all  WEX  live  ranges  contain  the  join  node. 
Hence,  i>l  Wf  A  live  -arves  must  reside  in  distinct  registers  and  all  WEX  live  ranges  must  reside  in 
■iietiinU  registers. 

It  is  easy  to  see  that  if  all  WEN  live  ranges  overlap  with  all  WEX  live  ranges,  the  corresponding  nodes 
must  form  a  clique  in  the  register  conflict  graph.  By  definition,  WEN  and  WEX  live  ranges  must  all 
overlap  with  BGWBAL  live  ranges.  If  in  addition  a  conditional  C  contains  no  BWCAL  live  ranges,  each 
WEN  and  WEX  live  range  can  be  changed  into  a  continuous  BGWBAL  live  range  without  changing  the 
register  conflict  graph.  This  is  done  as  follows:  all  WEN  and  WEX  live  ranges  are  "padded"  until  they 
contain  all  basic  blocks  that  form  C.  Thus,  the  register  conflict  graphs  for  the  individual  branch  clauses 
share  all  nodes  -  the  register  conflict  graph  of  an  arbitrary  branch  clause  is  equal  to  the  register  conflict 
graph  of  the  entire  conditional.  Therefore,  all  but  one  branch  clause  can  be  eliminated  from  the  flow  graph 
for  the  purpose  of  register  allocation  -  the  conditional  can  be  "collapsed"  into  one  arbitrary  branch  clause. 

If  the  WEN  and  LOEX  live  ranges  of  a  conditional  do  not  form  a  clique  in  the  register  conflict  graph, 
colors  used  for  a  LOEN  live  range  a  can  be  re-used  for  a  WEX  live  range  b  as  long  as  there  is  no  basic 
block  in  which  a  and  b  co-exist. 


70 


Definition  24:  (Overlap  set  of  a  LOEN  node)  Given  a  conditional  C,  the  overlap  set  of  LOEN 
live  range  a  in  C  is  the  set  of  LOEX  live  ranges*; x2<-Xn  in  C  such  that  there  is  at  least  one  basic 
block  in  which  a  and  x;,  ie  {l,...,n}  co-exist. 

All  LOEX  live  ranges  m  a  live  range  a’s  overlap  set  must  be  colored  with  colors  that  differ  from  a.  At  the 
same  time  LOEN  colors  can  be  be  re-used  for  LOEX  live  ranges  not  in  the  overlap  set. 

Figure  5-17  shows  the  register  conflict  graph  of  a  conditional  that  consists  only  of  LOEN  and  LOEX  live 
ranges.  In  that  example,  a,b,c  and  d  are  LOEN  live  ranges.  The  set  of  LOEX  live  ranges  consists  of  v,w >pc,y 


LOEN 


LOEX 


Figure  5-17:  Register  conflict  graph  consisting  of  LOEN/LOEX  live  ranges 
and  z.  Live  range  a  overlaps  with  LOEX  live  ranges  wjt  and  z;  b  overlaps  with  y  and  v  etc.  By  definition  of 
LOEN,  the  LOEN  live  ranges  form  a  clique;  the  same  is  true  for  the  LOEX  live  ranges.  If  a  maximal 
number  of  colors  used  to  color  the  LOEN  clique  can  be  r;-used  to  color  the  LOEX  clique,  the  coloring  for 
that  conflict  graph  is  optimal.  The  problem  of  re-using  the  maximal  number  of  colors  in  the  LOEX  clique 
can  be  mapped  to  the  maximal  matching  problem  for  a  bipartite  graph. 

Definition  25:  (Matching  in  a  graph)  A  matching  in  a  graph  G  is  an  independent  subset  of  its 
edges,  such  that  no  two  of  the  edges  are  adjacenL  A  maximal  matching  of  G  is  the  largest 
possible  independent  set  of  edges. 

We  construct  a  bipartite  graph  from  the  set  of  LOEN  live  ranges  and  LOEX  live  ranges  as  follows: 

1.  each  live  range  corresponds  to  one  unique  node, 

2.  there  is  an  edge  between  a  LOEN  live  range  a  and  each  LOEX  live  range  that  is  not  in  a’s 
overlap  set 

Figure  5-18  depicts  the  bipartite  graph  derived  from  the  register  conflict  graph  shown  in  Figure  5-17. 

LOEN  live  range  a  overlaps  with  LOEX  live  ranges  wpcj-  Therefore  the  LOEX  nodes  adjacent  to  a  are 
v,y.  It  is  easily  seen  that  the  graph  is  bipartite.  A  maximal  matching  of  the  LOEN  nodes  with  LOEX  live 
ranges  can  be  used  to  re-use  colors  of  LOEN  life  ranges  for  LOEX  live  ranges  safely.  Finding  an  optimal 
matching  for  the  nodes  of  a  bipartite  graph  can  be  done  in  polynomial  time.  An  algorithm  for  optimal 
bipartite  matching  can  be  found  in  [Smith  87]. 

The  maximal  matching  of  the  derived  bipartite  graph  is  used  to  derive  a  coloring  for  the  conflict  graph  as 
follows:  An  arbitrary  color  is  chosen  for  each  LOEN  live  range  and  for  each  unmatched  LOEX  live  range 
Matched  nodes  receive  the  color  of  their  "matching"-partner.  The  detailed  algorithm  is  given  in  Figure 
5-19. 
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Figure  5-18:  Bipartite  graph  consisting  of  WEN  and  LOEX  live  ranges 


Input:  The  LOEN  and  LOEX  and  BGLOBAL  live  ranges  of  a  non  nested  branch 

Output:  An  optimal  coloring  for  those  live  ranges 

Method: 


for  all  LOEN  live  ranges  l  do 

datarmlna  tha  ova r lap  sat; 
color  ascii  I  with  s  distinct  color; 
od 

G  . «  the  empty  graph; 

for  all  LOEN  live  ranges  I  do 

for  all  LOEX  live  ranges  x  e  overlap{I)  do 

add  x  and  I  G'r  vertices  if  not  yet  in  it; 
add  edge  <^x>  to  G'j  edge  set; 
od 
od 

bipartite  matching(G); 

for  all  LOEX  live  ranges  x  do 

if  x  has  been  matched 

then  color  x  with  tha  mat  chad  nodas  color 
etoe  color  x  with  a  naw  color; 
od 


Figure  5-19:  Coloring  a  register  conflict  graph  for  a  conditional  that  contains 

only  LOEN  and  LOEX  live  ranges 

The  restriction  on  the  live  ranges  in  a  conditional  that  allows  an  optimal  coloring  can  be  relaxed  by 
allowing  arbitrary  BGLOBAL  live  ranges  in  addition  to  WEN  and  WEX  live  ranges.  The  idea  behind  this 
relaxation  is  that  BGLOBAL  live  ranges  by  definition  contain  both  the  split  node  and  the  join  node, 
regardless  of  whether  they  are  continuous  or  broken.  Hence,  every  BGWBAL  live  range  co-exists  with 
every  WEN  live  range  in  the  split  node,  and  with  every  WEX  live  range  in  the  join  node.  Therefore,  the 
registers  assigned  to  BGWBAL  live  ranges  in  a  conditional  must  differ  from  the  registers  assigned  to 
WEN  and  LOEX  live  ranges.  This  is  all  summarized  in  the  next  theorem. 
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Theorem  26:  Given  a  conditional  that  contains  only  LOEN,  LOEX  and  BGLOBAL  live  ranges 
and  the  corresponding  register  conflict  graph  G,  an  optimal  coloring  for  G  can  be  found  in 
polynomial  time. 

Proof:  A  coloring  for  the  LOEN  and  LOEX  live  ranges  can  be  found  by  algorithm  5-19.  After 
application  of  algorithm  5-19,  a  new  color  is  chosen  for  every  BGLOBAL  live  range,  and  the 
overall  coloring  is  optimal  -  q.e.d. 

Note  that  the  bipartite  matching  algorithm  can  no  longer  be  used  to  find  an  optimal  coloring  in  the 
presence  of  BLOCAL  live  ranges  because  in  each  branch  clause  BLOCALS  can  be  adjacent  to  an  arbitrary 
set  of  LOEN  and  LOEX  live  ranges.  It  is  not  possible  to  incorporate  dependencies  of  BLOCALS  into  the 
bipartite  graph.  We  will  see  in  the  next  section  that  under  certain  restrictions  for  LOEN  and  LOEX  live 
ranges  it  is  possible  to  color  register  conflict  graphs  for  conditional  tranches  optimally  in  polynomial  time, 
even  if  a  conditional  contains  a  combination  of  BLOCAL  and  LOEN  and  LOEX  live  ranges. 

5.4.4.  Mixing  LOEN  and  LOEX  live  ranges  with  BLOCAL  live  ranges 

We  have  seen  that  register  conflict  graphs  for  straight  line  code  are  interval  graphs.  Interval  graphs  are 
contained  in  the  class  of  chordal  graphs. 

Definition  27:  (Chord)  Given  a  cycle  u>  a  graph  {cj,...^},  a  chord  in  the  cycle  is  an  edge 
between  two  non-consecutive  members  of  the  cycle. 

An  example  of  a  cycle  and  chords  in  the  cycle  is  shown  in  Figure  5-20.  The  cycle  consists  of  nodes 
{ 1^23,43.6,7,8} ;  the  chords  are  the  dashed  edges  <2,7>  and  <1 3>. 


Figure  5-20:  Chords  in  a  cycle 


Chordal  graphs  are  graphs  that  are  "triangularizable",  more  formally: 

Definition  28:  (Chordal  graph)  A  graph  G  is  chordal  iff  every  cycle  in  G  that  consists  of  more 
than  3  nodes  has  a  chord. 

Graphs  G1  and  G2  depicted  in  Figure  5-21  are  chordal  because  G1  and  G2  contain  no  chordless  cycle 
larger  than  4.  Note  that  graph  G1  consists  of  a  collection  of  "triangles".  Graph  G3  is  derived  from  G1  by 
adding  an  edge  between  nodes  7  and  2,  and  is  no  longer  chordal  because  cycles  { 1,2, 7 ,8}  and  {23,4,5,6,7} 
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G1 


G2 


G3 

Figure  5-21:  Examples  of  chordal  and  nonchordal  graphs 
are  chordless.  A  nice  property  of  chordal  graphs  is  they  can  be  colored  optimally  in  polynomial  time 
[Gavril  72], 

We  will  show  that  under  certain  restrictions  for  LOEN,  BGLOBAL  and  LOEX  live  ranges,  register  conflict 
graphs  for  conditionals  are  chordal.  Before  we  formalize  the  restrictions  in  a  theorem,  we  give  some 
intuitive  reasons  for  chordality  of  register  conflict  graphs. 

Given  a  conditional  branch  with  n  branch  clauses,  let  Gc  be  the  register  conflict  graph  for  the  *-th  branch 
clause.  It  is  easy  to  see  that  if  the  register  conflict  graphs  of  two  distinct  branch  clauses  have  nodes  in 
common,  those  nodes  must  be  BGLOBAL ,  LOEN  or  LOEX  live  ranges.  This  is  illustrated  in  Figure  5-22. 

We  see  a  conditional  construct  consisting  of  split  and  join  ncde  and  two  branch  clauses.  The  first  branch 
clause  consists  of  basic  blocks  1,2  and  3,  the  second  branch  clause  consists  of  basic  blocks  4,5,6  and  7. 
The  branch  contains  live  ranges  for  6  variables.  Since  each  variable  has  one  unique  live  range,  we  name 
the  live  ranges  after  the  variables.  Live  range  n  is  a  LOEN  live  range,  and  live  range  x  is  LOEX.  The  left 
branch  clause  contains  BLOCAL  live  range  d,  the  right  branch  clause  contains  BLOCAL  live  ranges  a,b  and 
c. 


Figure  5-23  shows  the  register  conflict  graphs  for  the  first  branch  clause,  labeled  Gl,  the  second  branch 
clause,  labeled  G2  and  the  register  conflict  graph  of  the  entire  conditional  branch. 
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Figure  5-22:  Conditional  branch 


Figure  5-23:  Nonchordal  register  conflict  graphs 

Note  that  live  ranges  x  and  n  occur  in  both  G1  and  G2,  but  that  in  the  first  branch  clause  x  and  n  co-exist: 
this  is  not  the  case  in  the  second  branch  clause.  Hence,  in  G1  there  is  an  edge  between  n  and  x,  but  not  in 
G2.  The  register  conflict  graph  of  the  entire  branch  construct  contains  an  edge  between  n  and  x,  and  the 
graph  contains  a  chordless  cycle  consisting  of  nodes  [aJ>,cj tjc) . 
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Figure  5-24  shows  the  same  conditional  branch  construct  as  in  Figure  5-22,  only  now  the  live  ranges  n 
and  x  overlap  in  both  the  first  branch  clause  and  the  second  branch  clause. 


split 


5 


6 


7 


Figure  5-24:  Conditional  branch  construct  with  chordal  register  conflict  graph 

This  changes  both  the  register  conflict  graph  for  the  second  branch  clause,  labeled  G2  and  the  register 
conflict  graph  for  the  entire  conditional  branch  as  shown  in  Figure  5-25. 

Note  that  the  register  conflict  graph  for  the  entire  branch  construct  contains  no  longer  a  chordless  cycle 
that  is  larger  than  3.  In  other  words,  the  register  conflict  graph  is  chordal.  We  formalize  -ie  observations 
about  chordality  of  register  conflict  graphs  for  conditionals  in  a  sequence  of  lemmas. 

We  now  analyze  types  of  live  ranges  that  can  be  part  of  chordless  cycles  larger  than  4.  First,  we  show 
that  there  must  be  at  least  one  pair  of  non-overlapping  LOEN/LC  EX  live  ranges  if  a  chordless  cycle  of  size 
larger  than  4  consists  only  of  BGLOBAL  nodes. 

Lemma  29:  Given  a  register  conflict  graph  G  of  a  non-nested  conditional  that  contains  no 
broken  live  ranges,  let  {cj, ....cj  be  a  chordless  cycle  in  G  of  4  or  more  nodes,  such  that  ci  is 
either  a  BGLOBAL,  a  LOEN  or  a  LOEX  live  range  for  i e  {l,...,/i}.  Then  there  must  be  at  least 
one  LOEN  live  range  ci  and  one  LOEX  live  range  Cj  such  that  Cj  is  not  in  c- s  overlap  set. 

Proof:  Suppose  that  the  overlap  set  of  every  LOEN  live  range  e  [cv...£n)  includes  every 
LOEX  live  range  e  [cv...^H).  Then  there  is  an  edge  between  every  LOEN  live  range 
€  {cj,...,cn}  and  every  LOEX  live  range  in  e  {cj,...,c„}.  Further,  the  LOEN  live  ranges  form  a 
clique  as  do  the  LOEX  live  ranges.  By  definition,  all  BGLOBAL  live  ranges  overlap  with  every 
member  in  the  cycle.  Hence,  the  nodes  in  {cj,...,^}  form  a  clique  and  therefore  cannot  form  a 
chordless  cycle  that  is  larger  than  3.  Hence,  there  must  be  at  least  one  LOEN  live  range  ci  and 
one  LOEX  life  range  c;  such  that  Cj  is  not  in  c/s  overlap  set  -  q.e.d. 
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Figure  5-25:  Chordal  register  conflict  graphs 


A  chordless  cycle  of  size  4  or  larger  can  not  consist  of  live  ranges  that  are  BLOCAL,  stated  in  the  next 
lemma. 

Lemma  30:  Given  a  register  conflict  graph  G  of  a  non-nested  conditional  that  contains  no 
broken  live  ranges,  let  be  a  chordless  cycle  in  G  of  4  or  more  nodes.  Then  there  must 

be  at  least  two  nodes  in  the  cycle  that  are  not  BLOCAL  in  every  branch. 

Proof:  Suppose  that  all  nodes  in  {cj,...,c„}  are  BLOCAL.  By  definition,  BLOCAL  live  ranges 
contain  neither  a  split  node  nor  a  join  node.  Consequently,  any  BLOCAL  live  ranges  that  are 
connected  must  be  part  of  straight  line  code  that  contains  no  split  or  join  nodes.  Register  conflict 
graphs  of  straight  line  code  are  interval  graphs  and  hence  can  not  contain  a  chordless  cycle  that  is 
larger  than  3. 

There  must  be  at  least  two  nodes  in  the  cycle  that  rre  not  BLOCAL,  for  suppose  there  is  only 
one  node  c  in  {c,,...,cB}  that  is  not  BLOCAL.  Because  there  exist  no  edges  between  BLOCAL 
nodes  It  'afferent  branch  clauses,  there  must  be  a  chordless  cycle  of  size  4  or  larger  that  consists 
of  cg  ano  nodes  that  are  BLOCAL  and  all  part  of  the  same  branch  clause.  This  is  not  possible, 
because  the  subgraph  formed  by  cg  and  BLOCAL  live  ranges  of  one  branch  clause  is  an  interval 
graph  and  can  not  contain  a  chordless  cycle  larger  than  3  -  q.e.d. 

Via  Lemmas  29  and  30  we  can  now  show  that  every  chordless  cycle  in  a  register  conflict  graph  of  a 
non-nested  conditional  branch  must  contain  at  least  one  pair  LOEN/LOEX  live  ranges  x  and  y  for  which 
there  is  a  branch  clause  in  which  x  and  y  do  not  co-exist 

Theorem  31:  Given  a  register  conflict  graph  G  of  a  non-nested  conditional  branch  that  contains 
no  broken  BGLOBAL  live  ranges,  let  be  a  chordless  cycle  in  G  of  4  or  more  nodes  that 

contains  BLOCAL  live  ranges.  Then  there  must  be  at  least  one  LOEN  live  range  x  and  one  LOEX 
live  range  y  such  that  there  is  at  least  one  branch  clause  in  which  x  and  y  do  not  co-exist 

Proof:  By  Lemma  30  there  must  be  at  least  two  members  of  [cv...jn)  that  are  not  BLOCAL. 

Let  x  and  y  be  two  arbitrary  members  of  the  cycle  that  are  not  BLOCAL. 

If  x  and  y  are  both  continuous  BGLOBAL  live  ranges,  they  co-exist  in  every  branch  clause  by 
definition.  Hence,  x  and  y  must  be  either  LOEN  or  LOEX. 
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We  will  now  assume  that  all  WEN  and  WEX  live  range  overlap  in  every  branch  clause.  Then 
there  can  be  at  most  two  WEN  or  WEX  live  ranges  in  the  cycle,  because  by  our  assumption 
more  than  two  WEN  or  WEX  live  ranges  must  contribute  a  chord  to  the  cycle.  Hence  the  cycle 
contains  and  all  c;e  {c1,...,cfl_2}  must  be  BWCAL  live  ranges.  This  is  depicted 

in  Figure  5-26.  Live  ranges  x  and  y  are  non  -BWCAL,  and  live  ranges  1 . 6  are  BWCAL. 


Figure  5-26:  Chordless  cycle  with  local  live  ranges  1,23,4,5,6 

Because  there  are  no  edges  between  BWCAL  live  ranges  of  distinct  branch  clauses,  all  the  ci  in 
the  cycle  must  be  part  of  the  same  branch  clause.  By  our  assumption,  x  and  y  overlap  in  every 
branch  clause.  Because  no  c,e  {c1,..,cn_2)  contains  a  split  or  join  node,  all  cte  {c1,..^:rt_2}  must 
occur  in  straight  line  code.  But  then  the  cycle  can  not  be  largo-  than  3. 

It  remains  to  be  shown  that  x  and  y  cannot  be  both  WEN  or  both  WEX.  We  will  show  that  for 
the  WEN  case  -  the  WEX  case  is  similar. 

Assume  the  cycle  looks  as  above  (depicted  in  Figure  5-26),  and  that  both  x  and  y  are  WEN  live 
ranges.  We  have  shown  that  all  c;  must  be  in  the  same  branch  clause  and  that  x  and  y  can  not 

overlap  in  the  branch  clause  in  which  live  ranges  Cj . c„_2  occur.  By  definition,  if  x  and  y  do  not 

ovolap  in  the  branch  clause  containing  Cj,...,cn_ 2,  either  x  or  y  can  not  be  adjacent  to  any  of 
c-e  jj  jt-2}'  ®ut  t*iere  0311  on]y  be  a  cycle  if  both  x  and  y  are  adjacent  to  members  of 

(Cl . cl2)  in  the  same  branch  clause  -  a  contradiction.  Hence,  x  and  y  can  not  both  be  WEN. 

By  the  same  arguments,  x  and  y  can  not  both  be  LOEX.  As  a  result,  x  and  y  must  be  a 
WEN/LOEX  pair  -  q.e.d. 

Theorem  31  states  that  if  there  exists  a  chordless  cycle  that  is  larger  than  3  in  a  register  conflict  graph  for 
a  non-nested  conditional  branch  that  contains  no  broken  BGWBAL  live  ranges,  there  must  be  at  least  one 
pair  of  WEN  and  WEX  live  ranges  x  and  y  such  that  there  is  a  branch  clause  in  which  x  and  y  do  not 
co-exist.  It  is  easy  to  check  if  WEN  and  WEX  live  ranges  co-exist  in  every  branch  clause,  and  if  they  all 
do,  we  can  show  that  the  register  conflict  graphs  of  such  branches  are  chordal,  stated  in  the  following 
Lemma: 

Lemma  32:  Let  G  denote  the  register  conflict  graph  of  a  non-nested  conditional  that  contains 
no  broken  live  ranges  and  in  which  all  WEN  variables  and  all  WEX  variables  co-exist  in  every 
branch  clause.  Then  G  is  a  chordal  graph  and  an  optimal  coloring  for  G  can  be  found  in 
polynomial  time. 

Proof:  Since  all  WEN  and  WEX  live  ranges  co-exist  in  every  branch  clause,  the  register 
conflict  graph  can  not  contain  a  chordless  cycle  larger  than  3.  Therefore,  G  is  a  chordal  graph. 

See  [Gavril  72]  for  an  algorithm  that  produces  an  optimal  coloring  for  chordal  graphs  q.e.d. 
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53.  Chapter  summary 

We  have  started  with  the  fact  that  register  conflict  graphs  for  straight  line  code  are  interval  graphs,  for 
which  an  optimal  coloring  can  be  found  in  polynomial  time.  We  showed  how  the  cliques  in  an  interval 
graph  can  be  ordered  to  form  a  sequence  of  cliques,  and  used  the  construction  to  prove  that  the 

standard  node  removal  technique  can  color  interval  graphs  optimally.  The  significance  of  finding  the 
sequence  of  cliques  in  an  interval  graph  is  that  each  one  of  those  cliques  can  be  used  as  a  separator  clique. 

We  examined  the  register  conflict  graphs  for  loops  and  conditionals,  Register  conflict  graphs  for  loops 
with  arbitrary  live  ranges  contain  the  class  of  circular  arc  graphs,  and  are  thus  NP  hard  to  color  optimally. 
If  a  loop  consists  only  of  continuous  backarc  live  ranges,  the  register  conflict  graph  of  the  loop  is 
equivalent  to  the  register  conflict  graph  to  the  flow  graph  derived  from  the  loop  by  removing  the  backarc 
live  range. 

Register  conflict  graphs  for  conditionals  are  more  complex  Wc  'lave  seen  that  optimal  colorings  can  be 

found  for  conflict  graphs  of  conditionals  with  the  following  proparties: 

1.  The  live  ranges  in  the  conditional  are  either  continuous  BGLOBAL  live  ranges  or  BLOCAL 
live  ranges  w.r.L  each  branch  that  contains  them. 

2.  The  conditional  contains  no  BLOCAL  live  ranges. 

3.  Every  pair  of  LOEN  live  range  x  and  LOEX  live  range  y  co-exist  in  every  branch  clause  and 
there  are  no  broken  BGLOBAL  live  ranges. 

The  motivation  for  examining  situations  in  which  loops  can  be  "linearized"  by  removing  the  loop  backarc 
and  conditionals  can  be  "linearized"  or  collapsed  by  ignoring  or  linearizing  individual  branch  clauses  is  that 
these  flow  graph  simplifications  turn  a  complex  flow  graph  into  a  "straightened  out"  flow  graph  that 
contains  more  straight  line  code  sequences.  Hence,  more  sequences  more  separator  cliques  can  be  found  in 
the  register  conflict  graphs.  Each  of  those  simplifications  can  only  be  carried  out  when  the  live  ranges  of 
the  loop/conditional  are  "well  behaved"  and  form  a  nice  register  conflict  graph. 

When  these  simplifications  can  not  be  carried  out,  it  is  possible  that  the  register  conflict  graph  is  not  an 
interval  graph.  In  the  next  chapter  we  will  discuss  a  technique  called  "node  merging"  with  which  a 
non-interval  register  conflict  graph  can  be  changed  into  an  interval  register  conflict  graph  in  some  cases. 
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Chapter  6 

Transformations  on  register  conflict  graphs 


In  the  previous  chapter  we  described  situations  that  allow  to  simplify  the  flow  graph  for  the  purpose  of 
register  allocation  without  altering  the  register  conflict  graph.  The  flow  graph  can  be  simplified  by 
removing  backarcs  from  loops  and  sequentializing  or  collapsing  conditionals.  The  purpose  of 
transformations  on  flow  graphs  is  to  detect  portions  of  the  register  conflict  graphs  that  are  interval  graphs  or 
chordal  graphs.  If  those  portions  are  known,  it  is  easy  to  partition  program  flow  graphs  into  clique 
connected  components  -  the  basis  for  parallelizing  global  register  allocation. 

In  this  chapter,  we  deal  with  loops  and  conditionals  that  can  not  be  simplified  to  equivalent  straight  line 
code.  For  such  programs,  register  conflict  graphs  are  usually  not  interval  graphs.  Our  approach  to  register 
allocation  is  based  on  mapping  non-interval  register  conflict  graphs  to  interval  register  conflict  graphs. 
Interval  register  conflict  graphs  are  desirable  for  two  reasons.  First,  they  can  be  colored  optimally  in 
polynomial  time.  Second,  clique  separators  are  easy  to  locate  in  interval  graphs.  Clique  separated  portions 
of  a  register  conflict  graph  can  be  colored  individually  and  combined  to  an  overall  coloring  by  renaming 
only. 

In  our  model,  there  are  two  sources  of  non-interval  register  conflict  graphs:  broken  live  ranges  and 
unrestricted  combinations  of  LOEN  and  LOEX  live  ranges  in  conditionals.  If  holes  in  live  ranges  can  be 
"stuffed"  with  live  ranges  that  "fit"  into  the  holes,  a  broken  live  range  can  be  transformed  into  a  continuous 
one.  We  demonstrate  situations  in  which  broken  live  ranges  can  be  transformed  into  continuous  live  ranges 
by  a  technique  we  call  node  merging.  Node  merging  can  also  be  applied  to  the  nodes  of  the  register 
conflict  graphs  of  different  branch  clauses  of  the  same  conditional:  given  a  conditional  C  with  branch 
clauses  c  and  c\  it  sometimes  is  possible  to  encode  the  register  conflict  graph  of  branch  clause  c  into  the 
register  conflict  graph  of  a  different  branch  clause  c' .  A  coloring  of  the  register  conflict  graph  for  c'  can  be 
mapped  directly  onto  the  register  conflict  graph  for  c  -  hence,  during  register  allocation  it  suffices  to  color 
only  the  graph  for  c'.  If  the  conflict  graphs  for  all  branch  clauses  can  be  "reduced"  to  that  of  one  single 
branch  clause,  the  resulting  conflict  graph  is  much  simpler  and  in  some  cases  can  even  be  colored  optimally 
in  polynomial  time. 

We  first  show  how  node  merging  can  be  applied  to  eliminate  holes  from  broken  live  ranges.  The  second 
part  of  this  chapter  is  devoted  to  the  application  of  node  merging  to  simplify  register  conflict  graphs  of 
conditionals. 
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6.1.  The  effect  of  holes  in  broken  live  ranges 

Eliminating  holes  from  broken  live  ranges  means  eliminating  one  source  for  non-interval  register  conflict 
graphs.  The  easiest  way  to  eliminate  a  hole  h  from  a  broken  live  range  v  is  to  add  all  the  blocks  that  form  h 
to  v.  There  are  cases  in  which  this  increases  the  chromatic  number  of  the  register  conflict  graph 
unnecessarily,  an  example  of  which  is  given  in  Figure  6-1.  The  left  side  of  the  figure  shows  a  broken  and  a 


0  0  0  0 
Original  conflict  graph  Conflict  0raPh  ^  Paddin8 

Figure  6-1:  Padding  a  hole  of  a  broken  live  range 

continuous  live  range.  Live  range  a  consists  of  basic  blocks  { 1 ,2,7,8)  and  contains  a  hole  consisting  of 
{3,4 ,5,6} ,  live  range  d  consists  of  basic  blocks  {3, 4,5, 6).  The  register  conflict  graph  is  shown  to  the  right  - 
there  is  no  edge  between  the  nodes  for  a  and  d  and  the  conflict  graph  is  1-colorable.  If  the  live  range  of  a  is 
padded  to  contain  the  basic  blocks  that  form  the  hole  (3,43,6),  the  node  for  d  can  no  longer  be  colored 
with  the  same  color  as  the  node  for  a  -  padding  a  causes  an  edge  between  nodes  a  and  d,  depicted  on  the 
right  side  of  Figure  6-1  -  the  resulting  conflict  graph  is  no  longer  1-colorable.  The  chromatic  number  of  the 
register  conflict  graph  increases  because  d  is  a  live  range  that  consists  only  of  basic  blocks  that  "fit"  into  the 
hole  of  live  range  a.  Hence,  padding  a  with  the  basic  blocks  that  form  the  hole  excludes  the  color  used  for 
a  from  the  colors  that  can  be  used  for  d.  Had  there  been  no  live  range  that  fits  into  a’s  hole,  padding  a  with 
the  blocks  that  form  the  hole  would  have  not  altered  the  register  conflict  graph.  This  is  formalized  in  the 
next  lemma. 

Lemma  1:  Given  a  flow  graph  with  basic  blocks  B,  live  ranges  L  and  register  conflict  graph  G, 
let  /  be  a  broken  live  range  with  hole  h.  If  there  is  no  live  range  range  v  such  that  v  consists  of  a 
subset  of  the  basic  blocks  that  form  h,  then  adding  h  to  /  does  not  change  G. 

Proof:  Let  G'  be  derived  from  G  by  adding  h  to  /.  We  prove  by  contradiction  that  G  and  G‘ 
must  be  equal,  so  we  assume  that  G  and  G’  differ.  Because  the  set  of  nodes  in  G  is  equal  to  the 
set  of  nodes  in  G'  except  for  /,  the  set  of  edges  not  incident  or  l  must  be  equal  as  well.  Clearly, 
the  number  of  edges  incident  on  /  in  G  must  be  less  than  or  equal  than  the  number  of  edges 
incident  on  (= l<uh  because  /  consists  of  a  subset  of  Suppose  there  is  a  node  v  such  that  there 
is  an  edge  between  v  and  /’  in  G'  but  no  edge  between  v  and  l  in  G.  Then,  v  must  contain  basic 
blocks  that  occur  in  h.  By  prerequisite,  v  must  then  also  contain  basic  blocks  that  occur  in  l,  so 
there  must  be  an  edge  between  v  and  l  in  G  -  a  contradiction. 

As  a  consequence  of  Lemma  1,  only  a  hole  h  for  which  there  exists  live  ranges  that  "fits"  into  h  can  cause  a 
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non-interval  register  conflict  graph.  An  example  of  a  such  non-interval  register  conflict  graph  is  given  in 
Figure  t>-2.  The  live  range  for  a  is  broken  and  contains  hole  {3, 4, 5, 6}.  The  live  range  for  c  consists  of 


Figure  6-2:  A  broken  live  range  and  a  non-interval  register  conflict  graph 
basic  blocks  {3,44,6} ,  and  hence  is  a  fit  for  the  hole  of  the  live  range  for  a.  The  register  conflict  graph  is 
shown  to  the  right  -  it  consists  of  a  chordless  cycle  of  size  4  and  is  therefore  not  an  interval  graph. 


We  will  show  that  certain  types  of  live  ranges  that  fit  into  a  hole  of  a  broken  live  range  can  be  used  for 
padding  that  hole  without  penalty  with  regard  to  the  chromatic  number  of  the  conflict  graph.  The  definition 
of  such  live  ranges  is  given  below. 

Definition  2:  (Fit  for  a  hole)  Given  a  flow  graph  consisting  of  basic  blocks  B  with  register 
conflict  graph  G,  let  /  be  a  broken  live  range  such  that  A  is  a  hole  of  l.  We  say  that  a  live  range  p 
is  a  fit  for  h  iff  p  is  continuous  and  pn(B  -h)=0. 

Figure  6-3  shows  broken  live  range  a  with  hole  h=  {3,4, 5, 6}  and  a  few  examples  of  live  ranges,  some  of 
which  fit  into  h.  The  live  range  &  is  a  fit  for  h,  because  b  is  continuous  and  consists  of  basic  blocks  that  are 
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Figure  6-3:  Fits  for  a  hole 

all  contained  in  h.  Because  c  is  not  a  continuous  live  range,  it  is  no  fit  for  h,  even  though  the  basic  blocks 
that  form  c,  {3,5,6}  are  a  subset  of  the  basic  blocks  that  form  the  hole.  Because  d  contains  basic  blocks 
outside  the  hole,  d  is  not  a  fit  for  h. 
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6.2.  Eliminating  holes  via  node  merging 

Node  merging  is  a  technique  to  enforce  that  two  non-adjacent  nodes  in  the  register  conflict  graph  are 
colored  identically  before  an  actual  coloring  is  produced.  Given  a  conflict  graph  G  with  two  nodes  and 
v2,  Vj  and  v2  are  merged  by  eliminating  node  v;  from  G  and  adding  all  edges  incident  on  v;  to  the  edges 
incident  on  v2  Under  certain  restrictions  we  can  ensure  that  node  merging  changes  a  non-interval  register 
conflict  graph  into  an  interval  graph. 

Node  merging  can  be  used  to  eliminate  holes  from  broken  live  ranges.  Given  a  broken  live  range  l  with  a 
hole  h  and  another  live  range  p  that  is  a  fit  for  the  hole  h,  there  can  not  be  an  edge  between  l  and  p.  Hence, 
it  is  possible  to  color  /  and  p  with  identical  colors.  Chosing  the  same  color  for  two  separate  live  ranges  can 
be  viewed  as  "melting”  the  two  live  ranges  into  one  -  the  live  range  /  is  changed  into  /’  by  adding  the  basic 
blocks  contained  in  p.  If  padding  /  with  p  results  in  continuity  of  there  is  one  less  broken  live  range  to 
worry  about  We  will  discuss  cases  in  which  there  are  several  fits  for  the  same  hole  or  in  which  several 
merge  operations  are  necessary  to  eliminate  one  hole.  We  use  node  merging  only  if  we  can  ensure  that  the 
chromatic  number  of  G'  derived  from  the  original  conflict  graph  G  by  live  range  merging  is  not  larger  than 
the  chromatic  number  of  G.  Note  that  u  coloring  for  the  conflict  graph  derived  by  node  merging  can  be 
directly  mapped  to  the  original  conflict  graph.  In  the  next  paragraphs  we  introduce  restrictions  that  must  be 
met  by  live  ranges  that  fit  into  a  hole  such  that  the  chromatic  number  of  the  conflict  graph  derived  by  node 
merging  is  equal  to  the  chromatic  number  of  the  original  conflict  graph. 

6.2.1.  Perfect  matches  for  holes 

Figure  64  gives  ar  example  of  a  conflict  graph  that  contains  a  broken  live  range  a  and  two  fits  for  its 
hole,  c  and  d  respectively.  The  original  non-interval  register  conflict  graph  is  shown  top  right.  The  register 
conflict  graph  aerived  by  merging  a  and  d  is  depicted  bottom  left.  It  is  an  interval  graph,  and  contains  two 
3-cliques.  Henc~,  3  colors  are  needed  to  color  the  register  conflict  graph  when  d  is  merged  with  a. 
Merging  a  with  c  yields  the  interval  graph  depk  ted  bottom  right  It  is  easy  to  see  that  2  colors  are 
sufficient  color  that  graph.  This  example  Ulus'  rates  that  deciding  which  live  range  to  use  for  merging 
with  a  broke*.  ‘;ve  range  can  be  nontrivial. 

In  some  case;  tlvi  choice  ef  the  candidate  for  gmg  with  a  broken  live  range  is  irrelevant,  depicted  in 
Figure  6-5.  The  broken  hve  range  a  contains  a  nole  consisting  of  {3, 4 ,5, 6}.  Both  c  and  d  can  be  used  to 
pad  the  hole.  The  original  register  comtict  graph  is  shown  top  right  merging  a  with  c  and  merging  a  with 
d  yields  the  two  register  conflict  graphs  depicted  at  the  bottom  of  the  figure  -  they  are  identical.  Note  also 
that  all  three  conflict  graphs  are  3 -colorable,  in  other  words,  no  merge  operation  causes  an  increase  of  the 
chromatic  number  of  the  drived  gi.phs.  v  — ason  why  the  choice  of  c  or  d  for  merging  with  a  is 
irrelevant  is  that  the  live  raises  that  *  ap  ?  idt  c  and  d  form  a  clique  in  the  register  conflict  graph, 
consisting  of  nodes  [c4  b,e)  We  call  nodes  like  c  and  d perfect  matches  for  the  hole  in  a. 

A  live  range  that  fits  into  a  hole  h  may  itself  contain  a  hole  h^.  Therefore,  live  ranges  that  fit  into  fit 
into  h  as  well.  In  that  situation,  the  sequence  in  which  h  and  are  padded  can  determine  whether  or  not  h 
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Marging  a  and  d  Merging  a  and  c 

Figure  6-4:  Different  conflict  graphs  for  different  merge  operations 
or  hfl  can  be  eliminated  via  node  merging.  Our  goal  is  to  keep  node  merging  to  eliminate  holes  simple,  so 
we  avoid  this  problem  by  requiring  continuity  of  a  perfect  match  for  a  hole.  To  ensure  that  interactions  of 
perfect  matches  with  other  live  ranges  remain  "local"  to  the  hole,  we  enforce  that  every  live  range  that 
overlaps  with  a  perfect  match  but  not  with  the  broken  live  range  must  also  be  a  fit  for  the  hole.  To  avoid 
backtracking  when  a  fit  for  a  hole  is  chosen  for  node  merging,  we  require  that  the  adjacency  lists  of  two 
overlapping  fits  for  a  hole  are  identical.  After  this  motivation,  we  can  define  perfect  matches  formally.  In 
the  following,  let  adjim)  denote  the  set  of  live  ranges  that  overlap  with  a  live  range  m. 

Definition  3:  ( Perfect  match  for  a  hole)  Given  a  register  conflict  graph  G,  we  say  that  a  live 
range  m  is  a  perfect  match  for  a  hole  h  in  a  broken  live  range  /  iff 

1.  m  is  a  fit  for  h 

2.  if  xe  adj(m)  -  {adj(f)nadj(jn)}  then  x  is  a  fit  for  k 

3.  if  xe  adj(m)  and*  is  a  fit  for  h  then  [adj{x) -m)  =  { adj(m)-x'i. 

Examples  of  perfect  matches  are  given  in  Figure  6-6.  Live  range  a  is  Ho?  ist d  contains  a  hslc 
h=  {3,4,5,6,7}.  Because  b,c  and  d  form  a  clique,  and  both  b  and  c  are  ins  ..nr  i  ooth  t>  and  c  art  lie  tea 
matches  for  h  -  for  the  same  reason,  e  is  a  perfect  maxh  for  h 


Perfect  matches  have  a  number  of  desirable  properties  that  to  tii.,:i'is?e  hot's  from  1  ’okon  tive 
ranges  without  paying  the  penalty  of  needing  additional  colors  to  mi  .*  th;  drrfvwJ  conflict  \  tr  fue  first 

property  of  a  perfect  match  p  for  a  hole  h  of  a  broken  live  i  i-  it  p  a  i-1 1  can  be  ’d  while 

maintaining  the  chromatic  number  of  the  original  conflict  graph  *"  Ve  de.?ve  I  r  inflict  ^laph. 

Lemma  4:  Given  a  register  conflict  graph  G  with  broken  live  range  l  with  hole  n  nad  a  live 
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Figure  6-5:  Equal  register  conflict  graphs  for  different  merge  operations 
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Figure  6-6:  Perfect  matches  for  a  hole 

range  p  such  that  p  is  a  perfect  match  for  h,  let  G’  be  the  conflict  graph  derived  from  G  by 
merging  the  nodes  l  and  p.  Then,  the  chromatic  number  of  G'  is  equal  to  the  chromatic  number 
of  G. 

Proof:  We  show  that  the  chromatic  number  of  the  two  graphs  are  the  same  because  a  coloring 
for  G’  can  be  mapped  to  G  and  a  coloring  of  G  can  be  used  to  produce  a  valid  coloring  of  G' 
without  adding  any  colors. 


Case  1:  mapping  a  cobring  of  G’  onto  G:  Let  pul  denote  the  node  derived  by  merging  live 
ranges  p  and  l.  A  valid  coloring  for  G’  can  be  expanded  into  a  valid  coloring  for  G  by  using  the 
color  used  for  pul  in  G'  for  both  p  and  /  in  G  and  by  copying  the  colors  for  all  other  nodes  in  G’ 
to  the  corresponding  nodes  in  G. 
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Case  2:  mapping  a  coloring  for  G  onto  G' :  Let  c7  denote  the  color  used  for  a  live  range  x  in  a 
coloring  of  G.  To  map  a  coloring  of  G  to  G',  we  first  change  the  coloring  of  G  such  that  p  is 
colored  with  ct.  If  there  is  a  live  range  xe  [adj(p)nadj(t)),lhen  cxe  (c/tcp),  so  altering  the  color 
for  p  to  has  no  influence  on  x.  Given  a  live  range  xe  { adj(p)-  {adj(p)rvidj(f)} }  such  that 
cz=Cj,  coloring  p  with  ct  is  not  possible  without  changing  the  color  for  x.  By  prerequisite 
{adj(x)-p)  =  { adj(p)-x }  .  Therefore,  y*pe  adj(x)->cy*c„.  Hence,  x  can  be  colored  with  cp. 
Therefore,  the  color  for  p  can  be  changed  to  ct  and  the  coloring  of  the  nodes  in  adj(p)  can  be 
adjusted  by  exchanging  ct  and  cp  -  in  other  words,  the  number  of  colors  is  not  increased  by 
coloring  p  and  /  identically.  Once  l  and  p  are  colored  identically,  merging  nodes  /  and  p  can  be 
performed  without  adjusting  coloring  -  therefore,  the  chromatic  numbers  of  G  and  G’  must  be 
equal. 

Given  a  perfect  match  p  fora  hole  A  of  a  broken  live  range  l  and  a  live  range  x  that  is  a  fit  for  h  such  that 
xe  adj(p),  it  is  easy  to  see  that  x  is  also  a  perfect  match  for  h  and  that  it  is  irrelevant  whether  x  or  p  is 
chosen  for  merging  with  l  -  in  the  example  shown  in  Figure  6-5,  the  derived  register  conflict  graphs  for 
merging  c  with  /  and  d  with  /  are  equivalent 


The  next  property  of  perfect  matches  is  that  the  order  in  which  perfect  matches  of  the  same  hole  h  are 
merged  is  irrelevant  -  the  remaining  perfect  matches  are  perfect  matches  of  the  holes  of  the  newly  derived 
merged  live  range.  Before  we  formalize  this  in  the  next  lemma,  we  go  through  an  example.  Figure  6-7 
shows  a  broken  live  range  a  with  perfect  matches  b,  c  and  e.  Because  the  adjacency  set  of  b  and  c  are 


Figure  6-7:  Sequences  of  perfect  matches 

equal,  it  is  irrelevant  which  of  the  two  is  chosen  to  be  merged  with  a,  so  we  chose  b.  The  picture  to  the 
right  shows  the  same  set  of  live  ranges,  except  a  and  b  have  been  merged  into  one  live  range  now.  Because 
c  was  adjacent  to  b  before,  c  is  now  adjacent  to  the  newly  derived  live  range  oub  and  can  no  longer  be  a 
perfect  match.  The  original  hole  of  live  range  a  has  now  shrunk  to  the  basic  blocks  {6,7},  and  e  is  still  a 
perfect  match  for  the  newly  derived  hole. 

Lemma  5:  Given  a  flow  graph  with  basic  blocks  B  and  register  conflict  graph  G  with  broken 
live  range  l  with  hole  h  and  liye  ranges  p  and  p’  that  are  both  perfect  matches  for  h,  let  /up  be  the 
live  range  derived  from  merging  /  and  p.  If  p'n(/up)=0,  then  p’  is  a  perfect  match  for  the 
remaining  part  of  the  hole  ti^hoi  the  live  range  derived  by  merging  /  andp. 

Proof:  To  show  that  p’  is  a  perfect  match  for  /up,  we  first  must  show  that  p'  is  a  fit  for  a  hole 
in  live  range  /up.  Because  p’  is  a  perfect  match  for  h,  p'nB  -h=0.  Because  p'n(/up) = 0,  /up 
must  have  a  hole  h’  such  that p'  is  a  fit  for  h’.  A  live  range  xe  {adj(p')-adj(lvp))  must  be  a  fit 
for  h’  as  well,  because  by  prerequisite  x  is  continuous. 
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It  remains  to  be  shown  that  [adjip'^-x)  -  [adj(x)-p'}Vx£  adjip') such  that* fits  h’ .  A  live 
range  xe  [adjip^-adji 0}  can  not  be  adjacent  to  p,  because  all  fits  of  h  adjacent  to  p  share  the 
same  adjacency  list,  but  p'  is  not  adjacent  to  p  by  prerequisite.  Therefore,  either 
x  e  adj([)  orxe  adjip").  Thus  merging  p  and  l  does  not  affect  the  fits  of  h’  that  are  adjacent  to  p' , 
and  by  prerequisite  [adjip^-x]  =  { adj(x ) -p') .  Hence,  p'  is  a  perfect  match  for  h' . 

Lemma  5  states  that  the  order  in  which  which  perfect  matches  are  merged  wiin  a  broken  live  range  does 
not  matter  at  all  -  the  graph  derived  by  node  merging  with  perfect  matches  is  identical  for  all  choices  of 
perfect  matches.  In  other  words,  no  search  is  necessary  to  choose  a  live  range  for  node  merging;  node 
merging  is  kept  simple  and  inexpensive.  Once  all  perfect  matches  are  used  up  and  the  hole  is  filled,  the 
derived  conflict  graph  is  still  equivalent  to  the  original  conflict  graph. 


6 22.  Imperfect  matches  and  breaks 

The  restrictions  that  must  be  met  by  perfect  matches  were  chosen  such  that  the  choice  which  perfect 
match  was  merged  with  a  broken  live  range  did  not  matter  at  all.  In  other  words,  merging  of  a  broken  live 
range  with  perfect  matches  for  its  holes  can  be  done  blindly.  If  we  are  willing  to  use  more  expensive 
algorithms  to  choose  live  ranges  for  padding  holes,  we  can  relax  the  restrictions  that  must  be  met  by  fits  for 
a  hole  and  still  guarantee  that  the  chromatic  number  of  the  derived  conflict  graph  does  not  increase. 

The  motivation  for  our  next  definition  is  to  chose  the  "largest"  fit  for  a  hole.  Given  a  hole  h  and  some  fits 
for  h,  an  imperfect  match  m  for  h  is  a  fit  whose  adjacency  list  adj(m)  includes  the  adjacency  lists  of  all  fits 
in  adj{m).  More  formally: 

Definition  6:  ( Imperfect  match  for  a  hole)  Given  a  register  conflict  graph  G,  we  say  that  a  live 
range  m  is  an  imperfect  match  for  a  hole  h  in  a  broken  live  range  /  iff 

1.  m  is  a  fit  for  h 

2.  If  x  e  adj{m)  -  adjif)  then  x  is  a  fit  for  h 

3.  If  p  e  (4r  is  a  fit  for  h)  then  adj(p)nadj( /)  zadj(m)nodj( /) 

An  example  of  an  imperfect  match  is  depicted  in  Figure  6-8.  Live  range  a  contains  a  hole  and  live 
ranges  c,  d  and  e  are  fits  for  that  hole.  The  set  of  nodes  adjacent  to  a  consists  of  { bj ).  Because  c  is 
adjacent  to  both  b  and/,  c  is  an  imperfect  match  for  a’s  hole.  Neither  d  nor  e  are  imperfect  matches  for  a’s 
hole. 

Similar  to  Lemma  4,  we  can  show  that  the  graph  derived  by  merging  a  broken  live  range  with  an 
imperfect  match  has  the  same  chromatic  number  as  the  original  graph. 

Lemma  7:  Given  a  flow  graph  with  basic  blocks  B,  register  conflict  graph  G  and  a  broken  live 
range  l  with  hole  h,  let  p  be  an  imperfect  match  for  h,  and  let  G'  be  the  conflict  graph  derived 
from  G  by  merging  p  and  l.  Then,  die  chromatic  numbers  of  G  and  G'  are  identical. 

Proof:  It  is  easy  to  see  that  any  coloring  for  G’  can  be  expanded  into  a  coloring  for  G  without 
using  additional  colors.  Given  a  coloring  of  G,  let  cx  denote  the  color  for  live  range  x  in  a 
coloring  to  G.  Let  [pv..-j>H]  be  the  live  ranges  that  are  fits  for  h.  If 
xe  adj{p^nadj(t)i e  {l,...,n},  then  cxe  {cp^},  hence  x  is  unaffected  by  changing  the  color  of  p 
to  c,.  Therefore,  the  only  nodes  affected  by  changing  p’s  color  to  ct  are  By 

prerequisite,  no  p-e  [plt...j>n]  is  adjacent  to  /,  therefore  cp  and  ct  can  be  exchanged  among 
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Figure  6-8:  An  imperfect  match  for  a  hole 

{px,...j)n}.  Hence,  a  coloring  of  G  can  be  used  to  produce  a  coloring  of  G'  without  introducing 
new  colors  -  the  chromatic  numbers  of  both  graphs  are  identical. 


Imperfect  matches  do  not  have  the  nice  properties  of  perfect  matches.  For  example,  given  an  imperfect 
match  m  for  a  hole  h  of  a  broken  live  range  l,  merging  m  and  /  might  not  fill  up  h  completely.  It  is  not 
guaranteed  that  the  remaining  hole  can  be  padded  with  the  remaining  live  ranges.  Imperfect  matches  are 
useful  however  if  they  are  "large"  enough  to  fill  the  entire  hole.  In  the  example  depicted  in  Figure  6-8,  the 
imperfect  match  c  fills  the  entire  hole  of  live  range  a.  The  fact  that  c  is  an  imperfect  match  allows  us  to 
deduce  that  by  merging  c  and  a  the  entire  hole  can  be  eliminated  without  increasing  the  number  of  colors 
needed  to  color  the  derived  register  conflict  graph. 


The  reason  why  broken  live  ranges  can  create  register  conflict  graphs  that  contain  chordless  cycles  of  size 
4  or  larger  is  that  all  fits  for  a  hole  h  are  interdependent  -  the  color  chosen  for  a  live  range /  that  is  a  fit  for  a 
hole  h  sometimes  influences  the  choices  of  colors  for  all  other  fits  for  h.  Since  we  are  forced  to  use  the 
same  color  for  the  TOP  and  the  BOT  part  of  a  broken  live  range,  chordless  cycles  are  introduced  in  a 
register  conflict  graph.  The  motivation  for  the  next  definition  is  to  determine  basic  blocks  inside  a  hole  h 
that  partition  the  fits  for  h  into  independent  sets.  If  there  is  such  a  basic  block,  the  fact  that  we  must  color 
TOP  and  BOT  parts  of  a  broken  live  range  identically  has  less  impact. 


So  rather  than  restricting  live  ranges  that  are  fits  for  a  hole,  we  now  define  restrictions  on  the  hole  itself. 
We  will  show  later  that  holes  that  meet  these  restrictions  can  be  padded  by  their  fits  without  compromising 
optimality  of  the  overall  coloring. 


If  there  is  a  basic  block  b  that  is  not  contained  in  any  fit  for  the  hole,  then  b  can  be  used  to  partition  the 
fits  for  the  hole  into  two  sets:  the  fits  that  consist  of  basic  blocks  that  consist  of  basic  blocks  that  precede  b 
and  those  that  consist  of  basic  blocks  that  succeed  b.  If  in  addition  the  only  live  ranges  that  contain  b  must 
also  contain  all  basic  blocks  of  the  hole,  no  dependencies  exist  between  the  fits  in  those  two  sets.  We  will 
show  that  the  absence  of  dependencies  can  be  exploited  to  eliminate  the  hole.  We  first  give  a  formal 
definition  of  a  break  in  a  hole. 

Definition  8:  (Break  in  a  hole)  Given  a  register  conflict  graph  G  with  a  broken  live  range  /  that 
contains  a  straight  line  hole  h=  (/ij,...,/in),  we  say  that  A(€  {hv...Jin}  is  a  break  for  h  iff  every 
live  mge  that  contains  hi  contains  {/i1(. i.e.  all  basic  blocks  that  form  the  hole. 
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Figure  6-9  gives  an  example  of  a  break  in  a  hole.  Live  range  a  contains  a  hole  consisting  of  basic  blocks 


Figure  6-9:  Break  of  a  hole 

{3,4,5,6,7}.  Basic  block  6  is  a  break  for  that  hole  -  the  only  live  range  that  contains  6  is/  which  contains 
basic  blocks  of  both  the  TOP  and  the  BOT  part  of  a  -  {1,2}  and  {8}  respectively.  The  set  of  fits  for  the 
hole  consists  of  {bfi4je}.  The  break  partitions  the  live  ranges  that  are  fits  for  the  hole  into  two 
independent  sets  -  one  "above"  the  break,  {b,c4}  and  one  "below"  the  break,  {e}. 

A  break  in  a  hole  is  desirable;  the  intuitive  reason  is  that  the  live  ranges  that  are  fits  for  that  hole  can  be 
partitioned  into  two  independent  subsets  via  the  basic  block  that  is  the  break.  The  live  ranges  in  one  of  the 
subsets  can  only  be  adjacent  to  the  TOP  or  BOT  part  of  the  broken  live  range,  but  not  both.  Therefore,  both 
parts  can  be  colored  independently  and  combined  by  renaming.  This  Is  stated  in  the  next  lemma. 

Lemma  9:  Given  a  flow  graph  F  with  register  conflict  graph  G  such  that  there  exists  a 
backarcfree  path  that  contains  all  basic  blocks  in  F,  let  /  be  a  broken  live  range  that  contains  one 
hole  h  with  break  bt  If  all  live  ranges  in  G  except  /  are  continuous,  G  can  be  colored  optimally 
in  polynomial  time. 

The  proof  of  Lemma  9  consists  of  three  parts.  In  the  first  part  we  show  that  the  flow  graph  F  can  be 
simplified  to  a  flow  graph  F  that  is  equivalent  for  the  purpose  of  register  allocation.  In  the  second  part  of 
the  proof,  we  consider  only  the  portion  of  F'  that  consists  of  the  loop  or  conditional  that  causes  the  hole, 
called  F" ,  and  show  how  to  color  the  conflict  graph  G”  for  F”  optimally  in  polynomial  time.  The  third 
part  of  the  proof  deals  with  extending  the  coloring  for  G”  to  a  coloring  for  the  original  register  conflict 
graph.  As  we  go  through  the  proof,  we  will  illustrate  our  steps  by  examples.  During  the  proof,  we  assume 
that  the  hole  containing  the  break  is  caused  by  a  loop.  The  case  in  which  the  hole  is  caused  by  a 
vonditional  can  be  shown  analogously. 

Proof:  Part  1 :  constructing  F’:  WJ.o.g.  assume  that  the  hole  h  is  caused  by  a  loop  L  with  loop 
head  a  and  loop  exit  e.  Let  F'  be  the  graph  derived  from  F  by  removing  the  following  edges: 

1 .  all  backarcs  except  that  of  L 

2.  all  edges  from  the  split  node  of  a  conditional  C  to  the  join  node  of  C. 

By  Lemma  15  the  register  conflict  graph  of  F  is  equal  to  the  original  register  conflict  graph  G. 

Part  2:  constructing  G':  Let  bj,...Jbn  be  the  set  of  basic  blocks  that  form  the  hole  h,  and  let  Z7” 
be  the  subgraph  induced  by  {b0Jb{,...JbnJ>n+l},  where  b0  is  b/s  predecessor  in  F',  and  bn+]  is 
bn's  successor  in  F’.  Because  l  is  the  only  broken  live  range,  bke  {by,...,bn},  and  both 
bgandb^j  el.  G’  consists  of  those  live  ranges  in  F’  that  do  not  contain  the  break  bk,  in  other 
words  live  ranges  that  contain  a  subset  of  the  following  basic  blocks:  •'WiMM- 
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The  construction  of  G'  is  depicted  in  Figure  6-10.  On  the  left  side  of  that  figure  the  original 
conflict  graph  G  is  shown.  The  broken  live  range  is  a,  and  the  "straight  line"  hole  of  a  consists  of 
basic  blocks  {5,6, 7,8,9}.  Basic  block  8  forms  a  break  in  the  confli'  Taph.  G’  is  shown  on  the 
right  side  of  Figure  6-10  -  a  now  consist  only  of  basic  blocks  {4,10} ,  me  two  blocks  that  "define” 
the  hole.  Only  live  ranges  that  either  fit  into  the  hole  or  that  contain  either  4  or  10  are  included  in 
G\ 


Figure  6-10:  G  and  G’ 


Because  bk  is  a  break  in  h,  no  live  range  in  G’  contains  b^  Then,  G’  can  be  partitioned  into  two 
subgraphs,  Gi  induced  by  those  live  ranges  that  consist  of  basic  blocks  among  the  "first  half'  of 
h,  {bQ,...Jbk_x }  and  G2  induced  by  the  live  ranges  that  consist  of  basic  blocks  in  the  second  half  of 
h,  {**+i,-A»+i}-  construction  of  G}  and  G2  is  depicted  in  Figure  6-11.  Note  that  both  G; 


Figure  6-11:  Gj  and  G2 

and  G2  contain  a  part  of  the  broken  live  range  /,  b0  and  bm+}  respectively.  Because  {b0,...rhi_i) 
and  bo*  form  straight  line  code  and  no  live  range  in  G}  and  G2  contains  holes,  G; 

and  G2  must  be  interval  graphs.  Hence,  both  G}  and  G2  can  be  colored  optimally  in  polynomial 
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time.  Let  cT0P  be  the  color  used  to  color  the  part  of  /  consisting  of  b0  in  G1  and  cB0T  the  color 
used  to  color  the  part  of  /  consisting  of  bm+I  in  G2 ■  An  optimal  coloring  for  G'  can  be  derived 
from  the  colorings  for  G7  and  G2  by  exchanging  cT0P  and  cB0T  in  G2.  Implicitly,  h  can  be 
eliminated  by  padding  h  with  all  live  ranges  in  G'  that  are  colored  with  Cj0P  and  merging  the 
appropriate  live  ranges  with  /  in  G'.  After  those  merge  operations,  /  is  equivalent  to  a  continuous 
live  range. 

Part  3:  deriving  a  coloring  for  the  original  register  conflict  graph:  Deriving  a  coloring  for  the 
original  register  conflict  graph  G  is  now  straightforward.  First,  we  add  to  G'  every  live  range  v 
that  contains  bt  Because  bk  is  a  break,  v  must  overlap  with  every  live  range  in  G’,  hence  a  new 
color  must  be  used  for  v. 

This  is  depicted  in  Figure  6-12.  In  our  example,  we  must  add  the  live  range  d.  Because  all 


a  b  co 


Figure  6-12:  Adding  live  ranges  that  contain  the  break 

remaining  live  ranges  of  G  are  continuous  and  the  original  program  is  equivalent  to  straight  line 
code,  the  live  ranges  that  contain  b0  must  form  a  clique  that  separates  G'  from  the  "top  part"  of  G. 
Likewise,  the  live  ranges  that  contain  bn*l  must  form  a  clique  that  separates  G'  from  the  "bottom 
third"  of  G.  The  two  clique  separators  are  shown  in  Figure  6-13.  Hence,  the  optimality  of  the 
coloring  for  G'  is  maintained  when  an  overall  coloring  is  derived  from  individual  optimal 
colorings  of  the  top  part  of  G,  G'  and  the  bottom  part  of  G  -  q.e.d. 

Note  that  if  there  exists  a  break  in  a  straight  line  hole,  no  restrictions  must  be  met  by  live  ranges  that  are 
fits  for  the  hole.  The  prerequisites  of  Lemma  9  may  seem  strict;  note  though  that  when  register  conflict 
graphs  are  simplified  hierarchically,  programming  constructs  that  are  nested  inside  the  basic  blocks  that 
form  a  hole  will  have  been  processed  by  the  time  the  hole  is  treated.  Theiviore  our  techniques  to  collapse 
or  sequentialize  branch  clauses  and  to  eliminate  holes  nested  inside  have  already  been  applied,  with  the 
effect  that  the  enclosing  hole  can  be  viewed  as  if  it  were  a  straight  line  hole  for  the  purpose  of  register 
allocation. 
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Figure  6-13:  Clique  separators  to  remaining  conflict  graph 
63.  Node  merging  and  conditionals 

In  the  previous  chapter  we  showed  that  register  conflict  graphs  in  which  all  backarc  and  B GLOBAL  live 
ranges  are  continuous,  and  in  which  all  LOEN  and  LOEX  live  ranges  overlap  in  each  branch  sibling,  are 
interval  graphs  and  can  be  colored  optimally  in  polynomial  time. 

For  the  remainder  of  this  chapter  we  assume  that  a  conditional  branch  contains  no  conditional  broken  live 
ranges.  Even  in  the  absence  of  broken  live  ranges,  there  are  cases  of  conditionals  that  have  register  conflict 
graphs  that  are  k  colorable,  but  not  interval  graphs.  In  the  following  sections  we  examine  types  of  register 
conflict  graphs  foi  ''onditionals  that  are  not  interval  graphs  and  therefore  in  general  can  not  be  colored 
optimally  by  the  node  removal  technique.  We  discuss  how  node  merging  can  be  used  to  simplify  register 
conflict  graphs  for  conditional  branches,  and  how  in  some  cases  node  me  <>ng  can  change  a  non-interval 
register  conflict  graph  for  a  conditional  into  an  interval  graph. 

63.1.  Merging  of  Ej  related  nodes 

Figure  6-14  depicts  an  example  of  a  conditional  with  a  register  conflict  graph  for  which  the  node  removal 
technique  is  unable  to  produce  a  2-coloring.  The  conditional  branch  consists  of  the  split  and  join  node,  and 
two  branch  clauses,  {B1J)2}  and  {B3;B4}  respectively.  Vaiiable  a  is  live  at  the  split  node  and  dead  at  the 
join  node,  hence  a  is  a  LOEN  live  range  in  that  branch.  Variable  d  is  Lve  at  the  pin  node,  and  dead  at  the 
split  node,  hence  d  is  defined  inside  the  branch  clauses  and  a  LOEN  live  range.  Variables  h  and  c  are  local 
inside  the  branch  clauses  and  are  therefore  B LOCAL  live  range  i . 
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Merging  nodes  b  and  c 

Figure  6-14:  Conditional  branch  with  LOEN  and  LOEX  live  ranges  and 

conflict  graph 


The  register  conflict  graph  for  the  conditional  branch  is  depicted  to  the  right  of  the  flow  graph.  The 
register  conflict  graph  for  the  conditional  branch  is  2-colorable  and  not  recognized  as  such  by  the  node 
removal  technique.  Since  b  and  c  are  live  in  distinct  branch  clauses,  the  color  chosen  for  node  b  is 
independent  of  the  color  chosen  for  node  c  in  the  register  conflict  graph.  A  2-coloring  of  the  graph  depends 
on  the  colors  chosen  for  b  and  c  -  they  must  be  identical.  The  knowledge  that  b  and  c  live  only  in  distinct 
branch  clauses  can  be  used  to  determine  that  it  is  safe  to  color  b  and  c  with  the  same  color.  Like  in  register 
conflict  graphs  for  loops,  nodes  in  register  conflict  graph  for  conditional  branches  can  be  merged  as  long  as 
it  is  safe  to  do  so.  This  is  depicted  in  the  bottom  right  of  Figure  6-14.  The  reason  why  b  and  c  could  be 
safely  merged  into  one  single  node  is  that  they  are  both  adjacent  to  exactly  the  same  set  of  nodes  in  the 
register  conflict  graph,  a  and  cL 

Definition  10:  (The  E 1  relation)  Given  the  register  conflict  graph  G  of  a  conditional  branch  C, 
let  Cj  and  c2  be  two  live  ranges  that  are  B LOCAL  w  j.l  C.  We  say  that  c,  and  c2  are  El  related 
iff  c}  and  c2  occur  in  different  branch  clauses  and  adj(c{)=adj(cj). 

It  is  easy  to  see  that  two  BLOCAL  nodes  that  are  E1  related  r  n  be  colored  identically  and  thus  merged 
into  one  node.  The  idea  behind  merging  E}  related  nodes  is  to  re-use  the  same  color  in  different  branch 
clauses  far  BLOCAL  live  ranges  that  are  by  definition  local  to  one  branch  clause.  The  graph  G’  derived 
from  the  original  register  conflic*  graph  G  by  merging  two  Ej  related  nodes  x  and  y  is  equivalent  to  the 
register  conflict  graph  of  a  conditional  branch  in  which  all  instructions  that  use  or  define  the  variable  with 
live  range  x  (or  y)  are  eliminated. 
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Conditional  branch 

Figure  6-15:  Register  conflict  graph  of  branch  derived  from  Figure 
6-14  by  eliminating  definitions  and  uses  of  variable  c 

An  example  is  depicted  in  Figure  6-15.  The  live  ranges  for  b  and  c  are  E}  related,  and  merged  in  the 
original  register  conflict  graph  (Figure  6-14).  The  branch  depicted  in  Figure  6-15  is  identical  to  that  in 
Figure  6-14,  except  that  the  definition  of  c  in  block  B3  and  the  use  of  c  in  block  B4  have  been  removed. 
The  register  conflict  graph  is  shown  to  the  right  and  is  identical  to  the  register  conflict  graph  containing  the 
merge  node  depicted  in  Figure  6-14. 

The  observation  that  one  of  two  Ej  related  live  ranges  can  be  eliminated  both  from  the  register  conflict 
graph  and  from  the  program  without  compromising  a  ^-coloring  of  the  register  conflict  graph  is  formalized 
in  the  next  lemma. 

Lemma  11:  Given  a  register  conflict  graph  G  of  a  conditional  branch  B  that  contains  two  Ej 
related  BLOCAL  live  ranges  x  and  y,  let  G'  be  the  register  conflict  graph  derived  form  G  by 
merging  tb s  nodes  representing  x  and  y.  Then,  the  chromatic  numbers  of  G'  and  G  are  equal. 

Proof:  W.l.o.g.  assume  that  the  candidate  for  removal  is  x.  The  graph  derived  from  G  by 
eliminating  the  node  x  and  all  edges  incident  on  x  is  equal  to  G'.  Therefore  a  coloring  for  G  can 
be  directly  mapped  onto  G\  omitting  the  color  used  for  x.  Because  adj(x)=adj(y),  the  color  used 
for  y  can  be  used  for  the  merge  node  (x,y).  It  is  easy  to  see  how  a  coloring  for  G'  can  be  mapped 
onto  G:  the  color  used  for  (x,y)  is  used  for  both  x  and  y  in  G.  Therefore,  the  chromatic  numbers 
of  G  and  G'  must  be  equal. 

Note  that  eliminating  node  x  from  the  register  conflict  graph  G  in  Lemma  1 1  is  equivalent  to  removing  all 
instructions  that  define  or  use  the  variable  v  with  live  range  x  from  the  conditional.  This  is  possible 
because  x  is  a  BLOCAL  live  range,  and  does  not  extend  beyond  the  split  or  the  join  node  of  the  conditional. 


Since  two  Ej  related  nodes  are  adjacent  to  the  exactly  same  set  of  edges  in  the  register  conflict  graph,  the 
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number  of  edges  incident  on  the  merge  node  is  equal  to  the  number  of  edges  incident  to  the  individual 
nodes  that  were  merged.  For  each  other  node  in  the  original  graph,  the  number  o  r  incident  edges  is  either 
the  same  or  decreases.  In  other  words,  node  merging  of  E}  related  nodes  always  simplifies  the  register 
conflict  graphs. 

In  the  example  depicted  in  Figure  6-14,  merging  of  E}  related  nodes  turns  the  original  register  conflict 
graph  into  an  interval  graph.  The  reason  is  that  every  BLOC/iL  node  is  Ej  related  to  a  BLOCAL  node  in 
every  other  branch  clause.  As  a  consequence,  all  BLOCAL  live  ranges  can  be  merged  with  their  E }  partners 
in  one  unique  branch  clause.  In  the  example  depicted  in  Figure  6-15,  the  instructions  that  form  live  range  c 
in  the  original  branch  depicted  in  Figure  6-14  are  removed.  Now,  there  exists  only  one  branch  clause  that 
contains  BLOCAL  live  ranges  -  all  others  contain  only  LOEN,LOEX  or  BGLOBAL  live  ranges. 

The  fact  that  all  BLOCAL  nodes  can  be  eliminated  by  node  merging  does  not  necessarily  mean  that  the 
graph  derived  by  all  merge  operations  is  an  interval  graph.  This  is  depicted  in  Figure  6-16.  The  live  ranges 
of  a  conditional  branch  that  consists  of  3  straight  line  branch  clauses  are  shown.  For  readability,  we  show 
only  the  live  ranges  rather  than  the  entire  flow  graph. 

Clausa  1  Clausa  2  Clausa  3 
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Figure  6-16:  Non-interval  register  conflict  graph  after  merging  all 

BLOCAL  live  ranges 


The  branch  contains  one  LOEN  live  range  x},  two  LOEX  live  ranges  x2  and  x3  and  two  BLOCAL  live 
ranges,  a  and  b  respectively.  (Note  that  Xj,  x^  and  x3  occur  in  each  branch  clause,  all  occurrences  form  one 
single  live  range)  I  ;ve  ranges  a  and  b  are  E}  related  -  both  are  adjacent  ioxj  and  x2,  and  can  therefore  be 
merged  in  the  register  conflict  graph.  The  original  register  conflict  graph  is  depicted  to  the  bottom  left,  the 
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conflict  graph  resulting  from  merging  a  and  b  is  shown  bottom  right  Even  though  all  BLOCAL  live  ranges 
axe  merged  into  a  single  live  range,  the  resulting  conflict  graph  is  not  an  interval  graph,  because  x}  overlaps 
with  x3  in  clause  3,  but  not  in  clause  1  or  2,  where  the  BLOCAL  live  ranges  occur. 

To  ensure  an  interval  conflict  graph,  in  addition  to  requiring  that  all  BLOCAL  live  ranges  can  be  merged 
into  one  unique  branch  clause  it  is  needed  that  the  subgraph  formed  by  the  LOEN  and  LOEX  live  ranges  in 
each  individual  branch  clause  is  identical.  Figure  6-17  depicts  LOEN  live  range  Xj  and  LOEX  live  ranges 
*2  and  x3  in  different  branch  clauses.  Note  that  this  example  is  identical  to  the  example  shown  in  Figure 
6-16  with  all  BLOCAL  live  ranges  removed.  The  register  conflict  graphs  for  the  individual  branch  clauses 

Clausa  1  Clause  2  Clause  3 
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Conflict  graphs 

Figure  6-17:  Register  conflict  graphs  formed  by  LOEN  and  LOEX  live 
ranges  in  individual  branch  clauses 

are  shown  below  the  corresponding  branch  cLruse.  Note  that  the  graph  in  clause  3  differs  from  the  graphs 
in  clauses  1  and  2. 

If  the  register  conflict  graphs  formed  by  LOEN  and  LOEX  live  ranges  are  identical  in  each  branch  clause, 
merging  Ej  related  live  ranges  a  and  b  is  equal  to  the  register  conflict  graph  of  clause  1,  shown  in  Figure 
6-18.  Since  clause  1  consists  of  straight  line  code,  the  register  conflict  graph  obtained  by  node  merging  is 
an  interval  graph.  We  formalize  this  in  the  next  lemma. 

Lemma  12:  Let  G  be  the  register  conflict  graph  of  a  conditional  B  with  branch  clauses 
c2,c2,...,cB,  such  that  B  contains  no  broken  BGLOBAL  live  ranges,  the  register  conflict  graphs 
formed  by  the  LOEN  and  LOEX  live  ranges  are  identical  in  each  branch  clause,  and  for  each 
BLOCAL  live  range  be  ci  3  b'e  c-lije  (l,..„nj  such  that  b  ard  b'  are  E}  related.  Then  the 
graph  G’  derived  from  G  by  merging  all  BLOCAL  live  ranges  with  a  node  representing  a  live 
range  in  one  unique  branch  clause  is  equivalent  to  the  register  conflict  graph  of  that  branch  clause 
in  isolation. 


% 


Claus#  1  Claus*  2  Clausa  3 


V 


*1 


x3 


x  1  split  .ioda 


x3 


x3 

t 


join  noda 


Original  conflict  graph 


Figure  6-18:  Interval  register  conflict  graph  after  merging  all  BLOCAL 

live  ranges 


Proof:  By  Lemma  11,  G’  is  equal  to  the  register  conflict  graph  of  the  conditional  B  by 
removing  all  instructions  that  form  live  ranges  represented  by  nodes  that  are  merged.  Further,  the 
register  conflict  graphs  for  the  LOEN  and  LOEX  live  ranges  are  identical  in  each  individual 
branch  clause  -  hence  the  register  conflict  graph  for  the  branch  clause  to  which  all  BLOCAL  live 
ranges  have  been  merged  is  equal  to  G’  -  q.e.d. 


In  the  next  section  we  show  how  the  strict  E1  relation  can  be  relaxed  in  a  way  that  still  allows  to  use  node 
merging  to  systematically  simplify  register  conflict  graphs  of  conditional  branches. 


6 32.  Merging  of  E2  related  nodes 

The  Ej  relation  is  very  restrictive.  Nodes  that  are  El  related  must  be  BLOCAL  wj.l  a  branch  B.  By 
definition  two  £;  related  nodes  must  be  adjacent  to  the  exactly  same  set  of  nodes  -  hence,  those  nodes  can 
not  be  BLOCAL  in  the  same  conditional  C.  We  can  relax  the  E1  relation  by  allowing  adjacency  to  other 
(restricted)  BLOCAL  nodes,  but  still  require  that  the  set  of  adjacent  nodes  that  are  LOEN,  LOEX  or 
BGLOBAL  be  identical.  More  formally: 

Definition  13:  Given  a  conditional  B,  let  x  and  y  be  two  live  ranges  that  are  BLOCAL  wj.l  B, 
such  that  x  and  y  occur  in  different  branch  clauses.  We  say  that  x  and  y  are  E2  related  iff 

•  All  BLOCAL  live  ranges  adjacent  to  x  have  the  same  adjacency  lists: 

Vx' e  {adj(x)nBLOCAL(B)}  :adj(xT)-x=adj (x)  - x! 

•  All  BLOCAL  live  ranges  adjacent  to  y  have  the  same  adjacency  lists: 
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Vy'e  {ad){y)r^BLOCAL{B)} : adjty1) ~y= adj(y) - y' 

•  The  set  of  non-BLOCAL  live  ranges  adjacent  to  x  is  equal  to  the  set  of  non -BLOCAL  live 
ranges  adjacent  to  y: 

adj(x)rG = adj(y)nG,  where,  G=LOEN(B)uLOEX(B)kjBGLOBAL(B ) 

Note  that  Ej  related  nodes  are  also  E2  related.  Figure  6-19  shows  the  live  ranges  in  the  branch  clauses  of 
a  conditional  branch,  consisting  of  LOEN  live  range  x},  LOEX  live  range  Xj  and  BLOCAL  live  ranges 
a, a!  a’  '  Jb  and  b’ .  Live  ranges  a  a’  and  a"  are  E2  related  to  live  ranges  b  and  b’ . 

Clausa  1  Clause  2 

split  node 


join  node 


Figure  6-19:  E2  related  nodes 


If  in  a  conditional  branch  all  BLOCAL  live  ranges  of  one  branch  clause  are  E2  related  to  a  BLOCAL  live 
range  in  every  other  branch  clause,  we  can  show  by  arguments  similar  to  those  used  in  Lemma  12  that 
merging  all  E2  related  nodes  results  in  a  graph  that  is  equivalent  to  the  register  conflict  graph  of  one  single 
branch  clause  -  given  that  the  register  conflict  .graphs  formed  by  LOEN  and  LOEX  live  ranges  are  identical 
in  each  individual  branch  clause. 


Things  are  a  bit  more  complicated,  because  Lemma  1 1  does  not  hold  for  two  E2  related  nodes.  This  is 
demonstrated  in  Figure  6-20.  Nodes  a  and  b  of  tlx  original  conditional  shown  in  Figure  6-19  are  merged 
into  one  single  node,  and  the  resulting  graph  is  depicted  to  the  bottom.  Unlike  for  E 1  related  nodes,  the 
graph  obtained  by  merging  a  and  b  is  not  equivalent  to  the  register  conflict  graph  of  the  conditional  in 
which  live  range  b  has  been  removed.  The  conditional  branch  in  which  live  range  b  has  been  removed  is 
shown  top  left  (Figure  6-20).  The  register  conflict  graph  for  that  conditional  branch  is  depicted  top  right, 
and  differs  fiom  the  graph  at  the  bottom. 
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Removing  live  range  b:  branch  and  conflict  graph 


Merging  a  and  b  in  original  register  conflict  graph 

Figure  6-20:  Difference  between  removing  live  range  b  and  merging  it  with  a  in  original  conflict  graph 

The  reason  why  Lemma  1 1  does  not  work  for  E2  related  nodes  is  that  E2  related  nodes  can  be  adjacent  to 
other  B LOCAL  live  ranges  within  the  same  branch  clause.  Merging  two  E2  related  nodes  n1  and  n2  causes 
(by  symmetry)  n2  to  move  to  a  different  branch  clause.  Other  BLOCAL  live  ranges  adjacent  to  n2  are  now 
adjacent  to  the  merged  node  consisting  of  and  therefore  an  edge  is  introduced  between  BLOCAL 

nodes  of  different  branch  clauses.  In  the  example  depicted  in  Figure  6-20,  the  node  that  contains  a  and  b  is 
adjacent  to  the  node  b’  -  the  merge  node  "belongs"  to  clause  1  and  b’  belongs  to  clause  2.  If  however  all 
BLOCAL  live  ranges  in  the  clique  formed  by  E2  related  nodes  are  merged,  „ae  resulting  graph  is  equivalent 
to  the  conflict  graph  of  a  branch  in  which  all  members  of  the  clique  have  been  removed,  shown  in  Figure 
6-21.  The  conditional  branch  in  which  both  b  and  b’  have  been  removed  is  shown  to  the  top  left,  and  the 
corresponding  register  conflict  graph  is  shown  to  the  right.  This  conflict  graph  is  identical  to  the  graph 
obtained  from  the  original  register  conflict  graph  by  merging  all  BLOCAL  members  of  the  clique  that 
contains  b  in  clause  2  with  their  E2  partners  of  clause  1. 

In  the  example  we  used  in  Figure  6-21,  nodes  b  and  b'  were  merged  with  their  E2  partners  in  clause  1. 
Since  the  number  of  BLOCAL  live  ranges  adjacent  to  a  is  larger  than  the  number  of  BLOCAL  live  ranges 
adjacent  to  b,  there  were  enough  nodes  in  clause  1  such  that  each  BLOCAL  node  adjacent  to  b  could  be 
merged.  This  is  not  the  case  if  we  wished  to  merge  the  other  way  around:  merging  three  live  ranges  a, a’ 
and  a”  with  two  live  ranges  b  and  b'  does  not  work  -  and  the  trick  by  mapping  the  resulting  graph  to  the 
register  conflict  graph  of  one  branch  clause  does  not  work  either.  This  is  <iepicted  in  Figure  6-22. 
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Clause  1  Clause  2 


Re>  xxv  .ifl  live  ranges  u  and  b1:  register  conflict  graph 


Merging  a,b  and  a".b'  in  original  register  conflict  graph 

Figure  6-21:  Merging  all  clique  members  of  one  branch  clause 

The  conditional  branch  in  which  live  ranges  a  and  a'  have  been  removed  is  shown  to  the  top  left,  and  the 
corresponding  register  conflict  graph  to  the  top  right.  Note  that  the  register  conflict  graph  differs  from  the 
graph  derived  by  merging  a  and  b  and  a'  and  b’  -  the  adjacency  of  a' '  with  the  merge  nodes  is  lost  if  live 
ranges  a  and  a’  are  simply  removed.  We  can  avoid  this  problem  by  adding  a  dummy  live  range  d  that 
consists  of  the  same  basic  blocks  as  b.  Note  that  adding  the  dummy  live  range  d  to  the  register  conflict 
graph  is  equivalent  to  duplicating  every  definition  and  use  of  the  variable  with  live  ran^c  b  by  the  definition 
or  use  of  a  dummy  variable.  Hence,  we  have  constructed  a  new  conditional  B'  such  that  the  register 
conflict  graph  for  B’  is  identical  to  the  graph  obtained  from  the  original  register  conflict  graph  by  merging 
a  and  a’  with  b  and  b’  respectively.  This  is  shown  in  Figure  6-23. 

Lemma  14:  Given  a  conditional  C  with  conflict  graph  G  that  consists  of  branch  clauses  cv...£n 
such  that  the  graph  induced  by  the  conflict  graph  of  an  individual  branch  clause  c;  and  the  LOEN 
and LOEX  live  ranges  is  identical  for  each  c,  e  {cj,...,cn},  let  v  and  v’  be  two  BLOCAL  live  ranges 
of  different  branch  clauses  ci  and  Cj  such  that  v  and  v’  are  E2  related.  The  chromatic  numbers  of 
G’  derived  from  G  by  merging  v  and  v’  are  identical. 

Proof:  Clearly,  a  valid  coloring  for  G’  can  be  easily  extended  to  a  valid  coloring  for  G  by 
directly  mapping  the  colors  of  all  nodes  to  the  same  nodes  in  G  and  by  mapping  the  color  of  the 
merged  node  (v.vO  to  both  v  and  v’.  Given  a  coloring  for  G,  we  obtain  a  coloring  for  G'  by 
enforcing  the  same  color  for  v  and  v\  Let  colv  and  colv.  be  the  colors  used  for  v  and  v’ 
respectively.  Because  every  LOEN ,  LOEX  or  BGLOBAL  live  range  that  is  adjacent  to  v  is  also 
adjacent  to  v’,  no  global  live  range  is  colored  with  colv  or  colv..  Let  x  be  a  BLOCAL  live  range 
adjacent  to  v.  By  definition  of  the  E2  relation,  all  BLOCAL  live  ranges  adjacent  to  v  have 
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Figure  6-22:  Merging  larger  clique  of  one  branch  clause  with  smaller  clique  of 

other  branch  clause 
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Figure  6-23:  Merging  larger  clique  of  one  branch  clause  with  smaller  clique  of 

other  branch  clause 
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identical  adjacency  sets.  Hence,  the  colors  colv  and  co/v.  can  be  exchanged  in  the  live  ranges  that 
are  adjacent  to  v  -  no  new  colors  are  needed. 

Note  that  if  all  B LOCAL  live  ranges  can  be  merged  with  BLOCAL  live  ranges  of  one  unique  branch 
clause,  the  derived  register  conflict  graph  is  equivalent  to  the  register  conflict  graph  of  just  one  branch 
clause.  Consequently,  E1  and  E2  related  merging  can  be  used  to  collapse  several  branc*  cl  auses  into  one  - 
in  the  case  of  non-nested  conditionals,  this  leads  to  straight  line  code  and  hence  to  at.  mterval  register 
conflict  graph. 

6.4.  Chapter  summary 

We  have  seen  that  node  merging  can  be  applied  to  broken  live  ranges  and  to  BLOCAL  live  ranges  of 
conditionals.  In  some  cases  it  is  even  possible  to  change  a  non-interval  register  conflict  graph  into  an 
interval  graph  via  node  merging.  Because  the  node  removal  technique  is  in  general  unable  to  produce  a  k 
coloring  for  ^-colorable  non-interval  graph,  there  are  cases  of  register  conflict  graphs  for  which  the  node 
removal  introduces  spill  code  that  can  be  avoided  altogether  by  node  merging.  Even  in  cases  where  it  is 
not  possible  to  eliminate  all  non-chordal  cycles  in  the  register  conflict  graph,  node  merging  always  results 
in  a  simpler  register  conflict  graph. 

The  advantages  of  encoding  structural  knowledge  in  register  conflict  graphs  are  twofold.  First,  the 
performance  of  the  node  removal  technique  can  be  provably  enhanced  by  node  merging.  Second,  structural 
analysis  permits  to  partition  the  register  conflict  graph  into  disjoint  subgraphs  that  can  be  colored 
individually  and  the  colorings  of  the  individual  parts  can  be  joined  to  an  overall  coloring  by  renaming. 
Both  aspects  of  structural  knowledge  in  register  conflict  graphs  are  the  focus  of  an  experimental  evaluation 
of  our  model  for  register  conflict  graphs,  and  is  discussed  in  the  next  chapter. 
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Chapter  7 

Structured  Global  Register  Allocation  in  Practice: 
Development  of  a  Parallel  Framework 


Developing  a  parallel  framework  for  global  register  allocation  required  the  analysis  of  register  conflict 
graph.  We  have  shown  that  this  is  possible  within  our  structured  model  described  in  the  previous  chapters. 
Initially  motivated  by  paralidization,  we  find  that  our  model  and  our  analytical  methods  lead  to  a  more 
thorough  understanding  of  sequential  global  register  allocation;  once  the  location  of  clique  separators  is 
known,  the  paralidization  of  global  register  allocation  comes  naturally,  almost  as  a  "byproduct"  of  our 
analysis.  The  success  of  the  parallelization  depends  on  the  knowledge  collected  by  our  method.  In  this 
chapter,  we  therefore  concentrate  on  the  evaluation  of  our  method. 

Our  method  is  best  described  as  analysis  of  register  conflict  graphs  prior  to  the  assignment  of  specific 
registers  to  live  ranges  in  the  conflict  graph.  The  purpose  of  this  analysis  is  to  map  a  register  conflict  graph 
to  an  equivalent  interval  graph  and,  if  this  is  not  possible,  to  identify  which  portions  of  a  register  conflict 
graph  are  "responsible"  for  an  overall  non-interval  register  conflict  graph.  Mapping  a  register  conflict 
graph  to  an  equivalent  interval  graph  is  done  via  mapping  a  well  structured  flow  graph  to  a  straight  line 
flow  graph  that  is  equivalent  for  the  purpose  of  global  register  allocation.  By  Theorem  1 1  of  Chapter  5,  the 
straight  line  sequence  of  basic  blocks  can  be  used  to  enumerate  all  cliques  in  the  register  conflict  graph  in 
the  .<.  order  described  in  Chapter  5. 

Interval  register  conflict  graphs  are  desirable  for  three  main  reasons.  First,  in  an  interval  graph  all  clique 
separators  can  be  found  in  polynomial  time.  Second,  interval  graphs  can  be  colored  optimally  in 
polynomial  time.  Third,  by  enumerating  all  cliques  of  an  interval  graph  in  the  .<.  order  it  is  possible  to 
identify  regions  of  the  conflict  graph  that  require  spilling:  given  k  registers,  live  ranges  that  occur  in  cliques 
that  are  larger  than  k  must  be  considered  for  spilling  to  memory  or  for  splitting.  The  important  fact  is  that 
this  is  known  before  the  actual  coloring. 

The  success  of  our  method  depends  on  how  often  our  transformations  of  flow  graphs  and  conflict  graphs 
are  applicable  to  real  programs  and  on  the  number  of  separator  cliques  that  can  be  located.  If  large  portions 
of  a  register  conflict  graph  can  be  mapped  to  equivalent  interval  graphs,  we  can  locate  a  large  number  of 
separator  cliques.  Knowledge  about  separator  cliques  is  the  basis  for  parallelizing  global  register  allocation 
but  is  particularly  well  suited  to  improve  sequential  register  allocation. 

Given  the  knowledge  about  the  location  of  clique  separators,  the  register  conflict  graph  can  be  partitioned 
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into  clique  connected  components  that  are  either  interval  graphs  or  non-interval  graphs.  In  that  partition, 
the  non-interval  graphs  are  formed  by  live  ranges  that  occur  in  conditionals  that  could  not  be  collapsed  or 
linearized,  or  by  live  ranges  that  contain  basic  blocks  that  occur  in  a  hole  of  a  broken  live  range  that  could 
not  be  eliminated.  In  other  words,  our  structured  method  enables  us  to  concentrate  on  small  portions  of  the 
register  conflict  graph  that  are  "responsible"  for  an  overall  non-interval  register  conflict  graph.  Because 
such  hot  spots  in  a  register  conflict  graph  typically  consist  only  of  a  small  subset  of  the  live  ranges, 
expensive  heuristics  cm-  even  exhaustive  search  can  be  used  to  find  an  overall  good  or  optimal  coloring. 
Isolating  regions  of  the  register  conflict  graph  in  which  spilling  is  necessary  is  useful  because  more 
knowledge  can  be  used  in  spilling  heuristics.  Further,  the  register  conflict  graph  can  be  partitioned 
hierarchically  via  clique  separators.  Clique  separated  critical  portions  of  the  register  conflict  graph  can  be 
colored  independently  and  combined  with  the  remaining  conflict  graph  without  compromising  optimality 
of  the  individual  coloring. 

To  evaluate  the  performance  of  our  method  for  analyzing  and  partitioning  register  conflict  graphs  for  well 
structured  programs,  we  implemented  our  techniques  and  measured  its  performance  on  a  set  of  t  enchmark 
kernels.  In  this  chapter,  we  present  our  implementation  and  report  the  results  of  our  evaluation. 

7.1.  Evaluation  of  our  method 

Our  method  pre-processes  the  register  conflict  graph  so  that  the  subsequent  register  coloring  can  use  the 
knowledge  gathered  during  the  analysis  and  operates  on  a  simplified  register  conflict  graph.  To  evaluate 
our  method  for  practical  purposes,  we  have  to  assess  how  applicable  our  flow  graph  and  register  conflict 
graph  transformations  are  for  real  flow  graphs.  The  transformations  of  our  method  achieve  three  goals: 

•  The  simplification  of  a  structured  program  into  a  piece  of  straight  line  code  that  is  equivalent 
for  the  purpose  of  register  allocation 

•  The  simplification  of  a  non-interval  register  conflict  graph  to  an  equivalent  interval  graph 

•  The  detection  of  separator  cliques  for  the  parallelization  of  global  register  allocation 

Transformations  of  flow  graphs  and  conflict  graphs: 

In  the  previous  chapters  we  have  developed  a  set  of  restrictions  that  must  be  met  by  the  live  ranges  of  a 
program  if  our  simplifications  can  be  carried  out  without  compromising  optimality  of  global  register 
allocation.  The  first  goal  of  an  experimental  evaluation  of  our  method  is  to  determine  how  often  this  set  of 
restrictions  is  met  by  real  programs.  For  programs  that  can  not  be  reduced  to  straight  line  code  for  the 
purpose  of  register  allocation,  we  determine  the  size  of  the  portions  that  could  not  be  reduced  to  straight 
line  code. 

Parallelization: 

Given  the  results  of  our  graph  transformations,  we  can  identify  separator  cliques  in  the  register  conflict 
graph.  We  demonstrate  the  potential  for  parallelization  by  measuring  the  size  of  the  ordered  sequences  of 
separator  cliques  in  the  register  conflict  graph  of  each  benchmark  kernel.  We  discuss  the  tradeoffs  between 
choosing  small  separator  cliques  and  creating  parallel  tasks  of  even  sizes. 
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7.1.1.  Implementation 

Registers  are  a  scarce  resource  on  every  computer  architecture  [Pat/Hen  90],  The  need  of  registers  is 
further  increased  by  global  optimizations  that  introduce  temporaries  that  have  to  reside  in  registers. 
Conflict  graphs  for  globally  optimized  programs  are  in  general  much  more  complicated  than  conflict 
graphs  produced  by  compilers  that  do  not  employ  global  optimizations.  For  this  reason  we  chose  as  input 
to  our  program  intermediate  code  produced  by  an  optimizing  compiler.  The  C  compiler  for  the  iwarp 
machine  fits  this  category  and  was  used  to  produce  the  input  to  our  conflict  graph  analyzer.  The  goal  of  our 
implementation  was  to  evaluate  two  aspects  of  the  method.  First,  to  determine  if  the  restrictions  that  permit 
an  interval  register  conflict  graph  are  met  by  realistic  user  programs,  and  second  to  assess  the  effectiveness 
of  a  parallelization  of  global  register  allocation  based  on  clique  separators. 

The  gist  of  our  method  is  the  analysis  of  register  conflict  graph,  and  our  techniques  aim  for  the 
Amplification  of  register  conflict  graphs.  To  assess  how  often  such  simplifications  could  be  carried  out  on 
the  flow-  and  conflict  graphs  of  realistic  programs,  it  was  sufficient  to  run  our  analysis  "off  line",  and  we 
did  not  incorporate  our  register  allocator  into  the  backend  of  the  rwAR?  compiler. 

Our  implementation  consists  of  3  phases.  Phase  1  is  the  "setup"  phase,  and  consists  of  extracting  the 
parse  tree,  the  register  conflict  graph  and  the  live  ranges  of  variables  from  the  input  program  in 
intermediate  form.  The  analysis  and  simplification  of  the  register  conflict  graph  are  performed  during 
Phase  2.  Phase  3  conducts  the  data  partitioning  for  the  parallelization  of  global  register  allocation.  This 
includes  determining  the  clique  separators  and  the  computation  of  the  accumulated  live  ranges  per  basic 
block.  Figure  7-1  gives  an  overview  of  the  parts  of  our  implementation.  Because  they  form  the  gist  of  our 
implementation,  we  describe  Phase  2  and  Phase  3  in  more  detail  in  the  next  paragraphs. 

7.1.2.  Simplifying  the  conflict  graph 

The  input  to  Phase  2  of  the  algorithm  is  the  parse  tree  of  a  program,  along  with  the  live  ranges  and  the 
register  conflict  graph.  Note  that  each  loop  or  conditional  in  the  flow  graph  is  represented  by  a  node  in  the 
parse  tree.  Innermost  constructs  are  the  leaves  of  the  tree,  while  the  successors  of  the  root  of  the  parse  tree 
represent  the  outermost  programming  constructs.  The  purpose  of  Phase  2  is  to  determine  which  basic 
blocks  belong  to  conditionals  or  loops  that  could  not  be  simplified  to  straight  line  code,  called  prohibited 
blocks.  The  nodes  in  the  parse  tree  are  processed  in  post  order.  The  restrictions  that  are  tested  in  our 
implementation  were  chosen  such  that  every  ode  in  the  parse  tree  must  be  processed  exactly  once.  Hence 
our  algorithm  does  not  involve  backtracking  and  runs  in  polynomial  time.  The  detailed  algorithm  for  Phase 
2  is  given  in  Figure  7-2. 

At  the  end  of  Phase  2,  every  basic  block  that  is  not  among  prohibited  blocks  is  part  of  straight  line  code, 
sometimes  derived  by  removing  backarcs  from  loops  and  by  the  sequentLoization  or  the  collapsing  of 
conditionals.  Hence,  all  live  ranges  that  contain  a  block  not  in  prohibited  blocks  form  a  clique  that 
separates  the  conflict  graph  into  two  disjoint  pieces.  Phase  3  consists  oi  computing  the  separator  cliques 
for  all  but  the  prohibited  basic  blocks  and  the  partitioning  of  the  flow  graph  into  clique  separated 
components.  The  actions  of  Phase  3  are  outlined  in  the  algorithm  given  in  Figure  7-3. 
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iWARP  compiler 
front  end  and 
global  optimizer 


Phase  1 


Phase  2 


Phase  3 


Figure  7-1:  Phases  erf  the  structured  global  register  allocator 


In  the  evaluation  of  our  method,  we  first  concentrate  on  the  effects  of  Phase  2,  and  analyze  the 
parallelization  in  Phase  3  separately. 


72.  Input  data 

We  measured  the  performance  of  our  method  for  33  benchmark  kernels.  The  benchmark  for  our 
evaluation  consists  of  the  following  functions: 

1.  The  Livermore  kernels  [McMahon  86] 

2.  The  larger  examples  of  the  Numerical  Recipe*  collection  [Press  et  al  88] 

3.  A  collection  of  programs  from  the  wart  erleau,  et  al.  88,  Kung  88] 

Because  our  method  requires  well  structure  we  had  to  modify  the  original  code  slightly  in 

some  cases.  Both  the  Livermore  kernels  (original  implementation  language:  fortran)  and  some  warp 
applications  (original  implementation  language:  W2)  woe  translated  to  C,  all  other  functions  of  the  testsuite 
were  readily  available  in  C.  It  should  be  noted  that  the  programming  style  of  the  Numerical  Recipes  in  C  is 
similar  to  FORTRAN  coding  style.  We  included  a  graph  manipulation  problem  ( blocks )  that  was  very 
carefully  handoptimized  in  C  pointers  and  address  arithmetic  were  used  excessively  and  the  programmer 
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Input:  Parse  tree,  live  ranges  and  conflict  graph  of  a  well  structured  program 

Output:  prohibited  blocks,  a  set  of  basic  blocks  that  may  not  be  used  for  the  computation  of 

clique  separators  and  a  simplified  conflict  graph 

Method: 

prohibited  blocks  :=  0; 

for  all  nodes  n  in  the  parse  tree  in  post  order  do  { 

if  n  is  a  conditional  { 

compute  WEN,  WEX,  BGLOBAL,  BWCAL 
determine  conditional  continuous  live  ranges; 
eliminate  holes  from  conditional  broken  live  ranges; 
prohibited  blocks  +=  uneliminated  holes; 

if  conditional  can  not  be  linearized  or  collapsed  { 

prohibited  blocks  +=  blocks  that  form  conditional; 

} 

}  else  { 

determine  loop  continuous  live  ranges; 
eliminate  holes  from  loop  broken  live  ranges; 
prohibited  blocks  +=  uneliminated  holes; 

} 

} 


Figure  7-2:  Outline  of  Phase  2 


Input:  Simplified  flaw  graph  and  prohibited  blocks 

for  each  basic  block  b  s(b),  the  set  of  forward  successors  of  b 

Output:  A  partitioning  of  the  register  conflict  graph 

Method: 

for  all  basic  blocks  b  t  {prohibited  blocks}  do  { 
separatorQs) = <o(IN(s))s  e  j(b); 


} 

partition  graph  based  on  separator  cliques  and  partial  graph  sizes; 


Figure  7-3:  Outline  of  Phase  3 

determined  which  variables  were  assigned  to  registers.  We  included  this  function  to  assess  the  shape  of 
register  conflict  graphs  of  applications  that  are  programmed  in  "typical"  C  style. 


We  chose  the  Livermore  loops  because  they  represent  a  standard  benchmark  of  kernels  that  are  used  in 
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many  scientific  applications  [Pat/Hen  90].  While  the  Livermore  kernels  consist  only  of  a  collection  of 
small  kernels,  we  chose  them  because  they  are  used  by  a  large  number  of  scientific  applications  The  same 
is  true  for  the  Numerical  Recipes  collection,  and  our  selection  of  kernels  from  the  Numerical  Recipes 
collection  was  guided  by  their  complexity:  we  chose  the  largest  procedures  from  that  collection  of 
programs. 

The  set  of  applications  from  the  WARP  libra  y  consist  of  applications  that  consist  of  "real"  programs 
instead  of  just  kernels,  and  are  (or  were)  in  ust  on  the  warp  and  IWARP  machines.  Both  the  Livermore 
loops  and  the  examples  from  the  v.  RP  library  represent  scientific  programs  that  are  well  suited  for  a 
globally  optimizing  compiler.  Our  graph  manipulation  program  was  chosen  as  an  example  of  a  non- 
numencal  program,  and  in  addition  was  carefully  hand  optimized.  Such  programs  are  usually  hard  to 
optimize  by  a  compiler. 

The  histograms  depicted  in  Figures  74  through  7-12  summarize  some  characteristics  of  our  benchmark 
kernels.  Figures  74,  7-5  and  7-6  depict  the  number  of  basic  blocks,  the  number  of  live  ranges  and  the 
nesting  depth  of  the  flow  graph  for  the  Livermore  kernels.  The  same  information  for  the  numerical  recipes 
is  shown  in  Figures  7-7, 7-8  and  7-9,  and  for  the  warp  examples  in  ”igures  7-10, 7-11  and  7-12. 


For  most  benchmark:  functions,  the  number  of  live  ranges  is  higher  than  the  number  of  basic  blocks.  One 
example  for  which  the  number  of  live  ranges  is  significantly  smaller  than  the  number  of  basic  blocks  is 
block,  a  graph  manipulation  program,  very  carefully  hand  optimized  in  C.  The  program  contains  35  global 
variables,  while  the  total  number  of  live  ranges  is  only  46.  Because  the  number  of  basic  blocks  in  that 
program  is  so  much  larger  (126),  it  can  be  concluded  that  the  "lack"  of  live  ranges  is  due  to  the  absence  of 
temporaries  after  global  optimizations.  This  is  not  surprising  because  hand  optimized  C  programs  that 
contain  many  pointer’s  make  it  hard  for  a  global  optimizer  to  identify  common  subexpressions. 
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Figure  7-5:  Livermore  loops:  number  of  live  ranges 


Figure  7-6:  Livermore  loops:  nesting  depth 


7 3.  Mapping  register  conflict  graphs  to  interval  graphs:  evaluating  Phase  2 

Phase  2  of  our  method  consists  of  applying  transformations  to  a  flow  graph  and  a  register  conflict  graph 
such  that  the  original  register  conflict  graph  can  be  mapped  to  an  equivalent  interval  graph.  Mapping  a 
register  conflict  graph  to  an  equivalent  interval  register  conflict  graph  can  only  be  achieved  if  the  following 
conditions  are  met: 

1.  All  holes  in  broken  live  ranges  can  be  padded, 

2.  All  conditionals  can  be  either  collapsed  or  linearized. 

We  showed  in  Chapter  6  that  we  are  able  to  pad  holes  optimally  in  polynomial  time  if  they  contain  breaks. 
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Figure  7-8:  Numerical  recipes:  number  of  live  ranges 


or  if  holes  can  be  padded  entirely  with  perfect  or  imperfect  matches.  As  a  "reward"  for  successful  node 
merging  of  all  broken  live  ranges  in  a  loop,  the  backarc  of  that  loop  can  be  eliminated  from  the  flow  graph 
without  affecting  the  register  conflict  graph.  Similarly,  conditionals  with  only  one  nonempty  branch  clause 
can  be  simplified  by  removing  the  link  from  the  split  node  to  the  join  node  if  all  holes  that  occur  in  the 
conditional  can  be  eliminated.  Therefore  the  simplification  of  conditionals  with  only  one  nonempty  branch 
clause  is  equivalent  to  successful  hole  elimination. 

Conditionals  with  multiple  nonempty  branch  clauses  -  unlike  those  with  only  one  nonempty  branch  clause 
-  can  cause  non-interval  register  conflict  graphs  even  if  all  live  ranges  are  continuous.  Thus,  in  our 
evaluation  of  the  simplifications  of  conditionals,  only  those  that  contain  at  least  two  nonempty  branch 
clauses  must  be  considered. 
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Figure  7-9:  Numerical  recipes:  nesting  depth 


btekprop  filming  »elect_b*ndi  block 

Figure  7-10:  warp  examples:  basic  blocks 


73.1.  Holes  and  node  merging  in  conflict  graphs 

Holes  and  node  merging  to  pad  holes  is  the  focus  of  our  first  set  of  measurements.  For  each  program,  we 
counted  the  total  number  of  holes  in  all  live  ranges  and  the  successful  attempts  to  pad  those  holes  under  one 
of  our  restrictions.  The  results  for  our  benchmark  kernels  whose  conflict  graphs  contained  broken  live 
ranges  are  summarized  in  Table  7-1. 

The  column  labeled  holes  contains  the  total  number  of  holes  of  all  live  ranges  in  the  register  conflict 
graph.  The  column  labeled  padded  contains  the  number  of  holes  that  could  be  eliminated  without  node 
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Figure  7-1.1:  WARP  examples:  number  of  live  ranges 


btckprop  filia-mg  selecljxjuk  block 


Figure  7-12:  warp  examples:  nesting  depth 

Holes  Padded  Merged  Remaining 


K20 

1 

1 

- 

none 

ludcmp 

1 

1 

- 

none 

svdcmp 

20 

9 

9 

2 

baclcprop 

2 

2 

- 

none 

block 

2 

2 

- 

none 

Table  7-1:  Node  merging  to  eliminate  holes 
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merging  because  no  fits  existed  for  these  holes  and  the  column  labeled  merged  depicts  the  number  of  holes 
that  were  eliminated  via  node  merging.  The  last  column  labeled  emaining  contains  the  number  of  holes 
that  could  not  be  eliminated  by  our  method. 

7 3 2.  Discussion  of  the  results  for  hole  elimination 

Of  our  33  benchmark  kernels,  5  had  register  conflict  graphs  with  broken  live  ranges.  Of  the  Livermore 
loops,  only  the  register  conflict  graph  of  Kernel  20  contained  a  broken  live  range  with  one  single  hole  that 
could  be  padded  without  node  merging.  Among  the  other  programs,  a  few  examples  of  the  Numerical 
Recipes  collection  and  of  the  warp  library  contained  broken  live  ranges.  For  all  functions  except  svdemp 
(single  value  decomposition),  the  holes  could  be  padded  without  node  merging.  Single  value 
decomposition  is  a  fairly  complicated  example  of  the  Numerical  Recipe  collection.  The  number  of  broken 
live  ranges  is  11,  and  the  total  number  of  holes  is  20,  9  of  which  coild  be  padded  without  node  merging, 
and  9  of  which  could  be  eliminated  via  node  merging.  The  large  number  of  broken  live  ranges  in  svdemp  is 
caused  by  frequent  re-definitions  of  variables  that  occur  in  conditionals.  Our  results  indicate  that  for  small 
kernels  such  as  the  Livermore  loops,  broken  live  ranges  do  not  pose  great  problems.  More  complex 
programs  such  as  the  warp  and  Numerical  Recipes  examples  can  contain  holes,  and  in  most  cases  our 
methods  to  pad  holes  with  or  without  node  merging  are  successful.  In  our  examples,  only  two  holes  could 
not  be  padded  with  our  method,  and  both  occurred  in  svdemp. 

7 33.  Sequentializing  and  collapsing  conditionals 

Our  second  set  of  measurements  was  conducted  to  examine  how  frequently  conditionals  could  be 
linearized  or  collapsed.  It  was  shown  earlier  that  conditionals  that  consist  of  only  one  nonempty  branch 
clause  can  be  treated  like  straight  line  code  provided  all  live  ranges  that  occur  in  such  conditionals  are 
continuous.  Hence,  successful  hole  elimination  is  sufficient  to  find  equivalent  straight  line  code  for  the 
purpose  of  register  allocation  for  such  conditionals.  Only  conditionals  with  multiple  nonempty  branch 
clauses  must  be  analyzed  for  simplification  by  sequentializing  or  collapsing. 

The  histograms  shown  in  Figures  7-13,  7-14  and  7-15  summarize  our  results  for  sequentializing  and 
collapsing  conditionals.  Each  benchmark  kernel  that  contained  conditionals  with  at  least  2  non-empty 
branch  clauses  is  described  by  two  bars.  The  left  bar  (shaded  in  grey)  depicts  the  number  of  conditionals 
with  at  least '  nonempty  branch  clauses.  The  right  bar  consists  of  a  bottom  and  a  top  part 

The  bottom  part  depicts  the  number  of  conditionals  that  could  be  linearized  because  they  contained  only 
BLOCAL  live  ranges  or  because  BLOCAL  live  ranges  were  absent  and  all  LOEN  and  LOEX  live  ranges 
formed  cliques. 

The  top  part  depicts  the  number  of  conditionals  that  could  be  collapsed  into  just  one  path  through  the 
conditional  because  all  BLOCAL  live  ranges  of  all  branch  clauses  were  £;  or  E2  related  to  BLOCAL  live 
ranges  in  a  different  branch  clause.  Conditional  tic*  are  collapsed  must  contain  a  combination  of  LOEN 
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and/or  LOEX  live  ranges  with  B LOCAL  live  ranges.  If  only  one  branch  clause  contains  BLOCAL  live 
ranges,  no  node  merging  takes  place.  If  both  branch  clauses  contain  BLOCAL  live  ranges,  the  conditional 
can  only  be  collapsed  if  the  BLOCAL  live  ranges  of  both  clauses  are  El  or  E2  related  and  have  been  merged 
in  the  conflict  graph.  E2  related  merging  occurred  in  two  of  our  examples:  Livermore  kernels  15  (2  merge 
operations)  and  17  (1  merge  operation).  In  all  other  cases,  only  one  of  the  branch  clauses  contained 
BLOCAL  live  ranges  and  no  node  merging  took  place  in  the  conflict  graph. 


K1S  K1I  K17  K20  K22 

Figure  7-13:  Simplification  of  conditionals  in  Livermore  loops 


Figure  7-14:  Simplification  of  conditionals  in  Numerical  Recipes  examples 
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Figure  7-15:  Simplification  of  conditionals  in  warp  and  graph  programs 

7  J.4.  Discussion  of  the  results  for  conditional  simplification 

Ever)’  conditional  in  the  programs  of  our  testsuite  could  be  linearized  or  collapsed.  The  number  of 
linearized  conditionals  is  much  greater  than  the  number  of  collapsed  conditionals.  In  most  cases,  this  was 
due  to  the  absence  of  BLOCAL  live  ranges.  For  those  conditionals  that  were  collapsed,  node  merging 
occurred  only  in  two  cases  (Livermore  kernels  15  and  17).  The  number  of  merge  operations  was  small 
compared  to  the  total  number  of  live  ranges  in  these  programs.  Interestingly,  outer  conditionals  whose 
branch  clauses  consist  of  complicated  nested  structures  were  always  sequentializable.  Typically,  large 
conditionals  that  are  sequentializable  cot  tain  no  LOEN  or  LOEX  live  ranges,  so  the  BLOCAL  live  ranges  in 
each  branch  clause  are  independent.  In  both  cases  that  required  node  merging  to  collapse  conditionals,  one 
branch  clause  consisted  of  a  long  straight  line  structure  while  the  second  branch  clause  consisted  of  a  short 
piece  of  straight  line  code  consisting  of  only  one  basic  block. 

13JS.  Combining  the  results  for  graph  transformations 

We  conclude  the  first  part  of  our  evaluation  by  summarizing  the  overall  result  of  our  transformations.  Of 
the  33  functions  in  our  benchmark,  24  had  interval  conflict  graphs  and  no  node  in  those  conflict  graphs  was 
altered  (by  adding  to  the  set  of  basic  blocks  that  form  a  live  range)  or  merged.  The  conflict  graphs  >>f  9 
benchmark  kernels  were  transformed  such  that  the  resulting  conflict  graph  was  an  interval  graph.  Such 
transformations  included  padding  of  holes  and  eliminating  basic  blocks  from  live  ranges  because  branch 
clauses  disappeared  due  to  collapsing.  There  was  only  one  function,  svdcmp ,  whose  conflict  graph  could 
not  be  transformed  into  an  interval  graph  by  our  technique.  The  conflict  graph  contained  2  hot  spots,  one 
consisting  of  4  and  one  of  13  live  ranges.  The  remaining  portions  of  the  conflict  graph  (total  number  of 
original  nodes:  205)  were  detected  as  interval  graphs  or  transformed  into  interval  graphs.  The  chromauc 
numbers  of  the  conflict  graphs  of  our  benchmark  kernels  are  summarized  in  Figures  7-16, 7-17  and  7-18 
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Figure  7-16:  Chromatic  numbers:  Livermore  loops 


Figure  7-17:  Chromatic  numbers:  Numerical  Recipes 


7.3.6.  Discussion  of  graph  transformation  results 

We  have  seen  that  in  all  but  one  cases  the  register  conflict  graphs  of  our  programs  could  be  simplified  to 
interval  register  conflict  graphs.  The  chromatic  numbers  of  the  graphs  were  largest  for  the  more 
complicated  examples  in  our  testsuite.  Single  value  decomposition  ( svdcmp )  was  the  only  program  that 
had  "hot  spots"  in  the  register  conflict  graph,  that  is  there  were  two  portions  in  the  register  conflict  graph 
that  could  not  be  simplified  to  interval  graphs  for  the  purpose  of  global  register  allocation.  The  number  of 
basic  blocks  in  those  portions  (23)  is  small  compared  to  the.  total  number  of  basic  blocks  in  that  program 
(195).  Further,  the  set  of  live  ranges  that  fit  into  those  portions  was  small,  so  exhaustive  search  for  that 
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Figure  7-18:  Chromatic  numbers:  warp  and  graph  programs 


subgraph  of  the  conflict  graph  is  feasible.  Interestingly,  the  chromatic  number  of  one  of  the  hot  spots 
dominated  the  chromatic  number  of  the  clique  separated  components  of  the  remaining  conflict  graph.  So 
exhaustive  search  of  a  small  component  of  the  register  conflict  graph  enabled  us  to  optimize  the  overall 
chromatic  number. 

All  except  one  conflict  graph  of  our  benchmark  functions  were  mapped  to  an  equivalent  interval  graph  by 
our  method.  The  one  non-interval  conflict  graph  contained  only  rave  small  non-interval  section,  such  that 
most  of  die  conflict  graph  could  be  mapped  to  equivalent  interval  graphs.  The  number  of  clique  separators 
found  in  each  register  conflict  graph  is  bounded  by  the  number  of  basic  blocks  that  could  be  transformed 
into  straight  line  code  for  the  purpose  of  register  allocation.  For  svdcmp,  85%  of  all  basic  blocks  and  for  all 
other  benchmark  kernels,  all  basic  blocks  could  be  transformed  to  straight  line  code,  so  at  least  85%  of  all 
clique  separators  were  located  for  svdcmp.  For  all  other  kernels,  our  method  was  able  to  locate  all  clique 
separators.  In  the  second  part  of  this  chapter  we  show  how  clique  separators  are  used  to  partition  a  register 
conflict  graph  into  independent  components. 

7 3.1.  Parallelization  based  on  clique  separators 

The  parallelization  of  global  register  allocation  is  useful  if  optimal  global  register  allocation  is  expensive. 
Given  Jfc  registers  and  a  Jfc-colorable  interval  register  conflict  graph,  an  optimal  assignment  of  registers  can 
be  found  in  linear  time.  In  that  case,  significant  parallel  speedup  for  global  register  allocation  can  not  be 
expected  because  the  overhead  of  parallel  task  management  and  re-combination  of  individual  results  would 
cancel  any  speedup  over  sequential  register  allocation. 

Optimal  global  register  allocation  is  expensive  if  the  register  conflict  graph  is  not  an  interval  graph  and  if 
the  register  conflict  graph  is  not  Jfc-colorable,  and  live  ranges  must  be  split  or  spilled  to  memory.  By  our 
analysis,  the  non-interval  graph  portion  of  practical  register  conflict  graphs  can  be  restricted  to  relatively 
small  subgraphs. 
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Arbitrary  register  conflict  graphs 

Exponential  coloring  algorithms  are  feared  for  a  reason:  suppose  we  have  an  upper  bound  of  k  colors 
needed  to  color  an  arbitrary  graph  with  15  nodes,  but  would  like  to  find  an  optimal  coloring,  possibly 
reducing  the  number  of  registers.  In  this  scenario,  exhaustive  search  requires  inspecting  k15  different 
combinations  -  for  k= 5,  the  number  of  combinations  is  several  billions.  In  reality,  exhaustive  search  need 
not  be  so  prohibitively  expensive,  because  large  quantities  of  combinations  can  be  pruned  from  the  search 
space.  It  is  very  difficult  to  make  an  accurate  prediction  of  the  time  it  takes  to  produce  an  optimal  coloring 
of  an  arbitrary  graph. 

Spilling 

If  the  chromatic  number  of  an  arbitrary  or  interval  register  conflict  graph  exceeds  the  number  of  available 
registers,  spilling  and/or  live  range  splitting  must  be  applied.  Good  spilling  is  crucial  to  the  execution  time 
of  a  program,  and  optimal  spilling  is  NP  complete  even  for  straight  line  code,  and  therefore  for  interval 
register  conflict  graphs.  Because  good  spilling  decisions  can  be  expensive,  a  parallelization  based  on 
clique  separators  is  likely  to  achieve  good  parallel  speedup.  It  is  a  disadvantage  that  the  optimality  of 
spilling  decisions  for  individual  portions  of  the  register  conflict  graph  can  not  always  be  maintained  when 
clique  connected  components  are  combined. 

It  is  difficult  to  predict  the  complexity  of  global  register  allocation  in  the  presence  of  spilling  or  non- 
interval  register  conflict  graphs.  The  reason  is  that  heuristic.'  for  good  spilling  or  good  colorings  of 
non-interval  portions  can  be  arbitrarily  expensive.  In  our  benchmark  the  chromatic  numbers  of  the  register 
conflict  graphs  range  between  8  and  59.  The  number  of  registers  in  most  current  computer  architectures  is 
insufficient  for  a  number  of  our  conflict  graphs.  Depending  on  the  spilling  heuristics  employed,  the 
running  time  of  global  register  allocation  can  be  large  for  the  register  conflict  graphs  of  our  benchmark. 

73.8.  Method  of  parallelization:  an  example 

The  parallelization  of  global  register  allocation  via  clique  separators  is  best  explained  by  example.  Figure 
7-19  depicts  an  interval  register  conflict  graph  and  its  partition  into  two  subgraphs  via  a  clique  separator. 
The  graph  labeled  "conflict  graph"  depicts  the  conflict  graph  of  the  entire  program  that  consists  of  basic 
blocks  {133,4,5,6,7,8,9,10}.  The  set  of  live  ranges  consists  of  [ajbx4^f\-  The  chromatic  number  of  the 
conflict  graph  is  3.  Because  the  program  consists  of  straight  line  code,  every  basic  block  can  be  chosen  to 
find  a  separator  clique.  The  box  drawn  around  basic  block  5  and  across  the  live  ranges  that  contain  block  5 
depicts  the  separator  clique  based  on  5.  Block  5  is  contained  by  live  ranges  {b,c4}-  Therefore,  [bx4] 
form  a  clique  that  partitions  the  conflict  graph  into  two  independent  components,  the  top  and  bottom  part 
depicted  below  the  original  conflict  graph  Both  the  top  part  and  the  bottom  part  can  be  colored 
independently,  as  long  as  the  separator  clique  is  colored  in  both  parts.  The  colorings  of  both  parts  can  be 
combined  by  adjusting  the  colors  used  for  the  live  ranges  in  the  separator  clique,  namely  {b,c4)-  The 
number  of  live  ranges  whose  coloring  must  be  adjusted  depends  on  the  number  of  live  ranges  colored  with 
the  colors  used  for  b,c  and  d. 
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Figure  7-19:  Parallelization  via  clique  separators 

Note  that  in  a  parallel  implementation  based  on  clique  separators,  the  nodes  of  each  separator  clique  must 
be  colored  twice,  which  is  not  necessary  in  a  sequential  implementation.  So  the  parallel  algorithm  does 
more  "work"  than  the  sequential  algorithm.  In  our  example,  the  size  of  the  separator  clique  is  3.  Given  two 
parallel  processors,  3  extra  coloring  steps  must  be  performed  plus  the  re-naming  during  post  processing. 
Because  the  size  of  the  separator  clique  is  equal  to  the  chromatic  number  of  the  conflict  graph,  every  live 
range  in  the  conflict  graph  must  be  re-colored,  if  none  of  the  colors  for  the  elements  of  the  separator  clique 
match.  Had  we  chosen  a  separator  clique  that  is  smaller,  both  the  redundant  work  and  the  number  of 
re-coloring  steps  would  be  smaller,  depicted  in  Figure  7-20.  The  conflict  graph  is  identical  to  that  depicted 
in  the  previous  figure,  but  this  time  the  separator  is  chosen  based  on  block  4.  The  new  separator  clique 
consists  only  of  (h),  and  given  an  indec<ndent  coloring  for  both  parts  separated  by  {£>),  at  most  one  live 
range  must  be  re-named  when  the  two  colorings  are  combined. 

This  example  demonstrates  that  the  size  of  a  separator  clique  determines  both  the  amount  of  redundant 
work  spent  in  parallel  register  allocation  and  the  post-processing  time.  In  the  nest  section  we  present 
measurements  of  separator  cliques  for  some  of  our  benchmark  kernels. 
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Figure  7-20:  Chosing  a  small  separator  clique 

7.4.  Parallelization  based  on  clique  separators:  evaluation  of  Phase  3 

Our  method  transforms  a  flow  graph  to  a  straight  line  sequence  of  basic  blocks  that  are  equivalent  for  the 
purpose  of  register  allocation.  By  Theorem  11  of  Chapter  5,  this  sequence  of  basic  blocks  can  be  used  to 
compute  an  ordered  sequence  of  cliques  in  the  conflict  graph,  each  of  which  is  also  a  clique  separator.  The 
parallelization  consists  of  partitioning  the  register  conflict  graph  via  these  clique  separators. 

In  interval  conflict  graphs,  the  processing  time  for  the  parallel  tasks  is  correlated  to  the  number  of  live 
ranges  in  each  subtask.  So  if  it  is  possible  to  divide  the  conflict  graphs  such  that  the  clique  connected 
components  contain  an  equal  number  of  nodes,  and  in  addition  the  clique  separators  at  those  points  are  not 
too  large,  parallelism  based  on  clique  separators  is  intuitively  successful  We  present  measurements  of  both 
the  separator  clique  sequences  and  the  distribution  of  live  ranges  for  some  examples  of  our  benchmark 
kernels  in  Figures  7-21,  7-22,  7-23,  7-24,  7-25,  7-26  and  7-27.  The  x-axis  depicts  the  basic  blocks  in  the 
flow  graph,  the  y-axis  depicts  the  number  of  live  ranges  as  a  percentage  of  the  total  number  of  live  ranges 
in  the  conflict  graph.  For  each  basic  block  that  is  not  marked  prohibited  by  our  method,  the  bottom  line 
depicts  the  size  of  the  separator  cliques  as  percentage  of  the  total  number  of  live  ranges,  and  the  top  line 
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depicts  the  accumulative  number  of  live  ranges.  If  for  example  the  value  of  the  bottom  line  for  basic  block 
10  is  15,  the  value  of  the  top  line  45,  then  the  size  of  the  separator  clique  is  15%  of  the  total  number  of  live 
ranges  and  45%  of  all  live  ranges  start  at  block  10  or  earlier. 

The  results  for  svdcmp  are  depicted  in  Figure  7-21.  The  total  set  of  live  ranges  is  distributed  relatively 
evenly  over  the  set  of  straight  line  basic  blocks.  At  every  basic  block,  the  size  of  the  separator  cliques  is 
only  a  small  percentage  of  the  total  number  of  live  ranges.  Between  individual  separator  cliques,  there  are 
relatively  large  differences.  Note  that  at  three  basic  blocks,  67,  92  and  124  the  size  of  the  separator  clique 
is  very  small.  These  basic  blocks  coincide  with  loop  exits  in  the  program.  Basic  blocks  for  which  the  size 
of  the  separator  clique  is  not  shown  are  the  prohibited  blocks  among  which  our  method  is  unable  to  find 
separator  cliques. 


Figure  7-21:  Separator  cliques  and  accumulated  live  ranges  for  svdcmp 

We  demonstrate  the  use  of  the  information  depicted  in  Figure  7-21  for  the  parallelization  of  global 
register  allocation  by  an  example.  Suppose  we  wish  to  parallelize  global  register  allocation  for  svdcmp 
with  3  processors,  such  that  each  processor  processes  an  equally  large  component  of  the  conflict  graph. 
The  live  ranges  must  be  partitioned  into  3  sets,  each  consisting  of  33%  of  the  conflict  graph.  The  clique 
that  separates  the  first  33%  of  the  conflict  graph  is  found  at  basic  block  60,  the  next  component  consists  of 
all  live  ranges  between  basic  blocks  60  and  130  (where  66%  of  the  live  ranges  have  been  accumulated)  and 
the  last  component  consists  of  the  live  ranges  between  basic  blocks  130  and  195.  Obviously  only  basic 
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blocks  for  which  the  separator  clique  exists  (i.e.  non-prohibited  blocks  found  in  Phase  2  of  our  algorithm) 
can  be  used  to  partition  the  live  ranges. 

Figure  7-22  shows  the  same  data  for  the  backpropagation  algorithm  of  the  warp  library.  The  line 
depicting  the  accumulated  live  ranges  rises  very  steeply  for  the  first  few  basic  blocks,  and  then  climbs 
almost  linearly.  This  is  due  to  a  relatively  large  number  of  global  variables  in  that  program.  The  size  of  the 
separator  cliques  Is  very  even  throughout  the  program,  but  compared  to  the  total  number  of  live  ranges 
noticeably  higher  than  for  svdcmp.  Again,  this  can  be  explained  with  the  number  of  global  variables  that 
remain  live  throughout  the  program,  so  they  are  part  of  every  separator  clique.  The  code  for  that  program 
is  typical  "spaghetti  code",  a  long  sequence  of  not  too  complex  programming  constructs. 
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Figure  7-22:  Separator  cliques  and  accumulated  live  ranges  for  back 

Our  next  example  is  the  frequency  domain  filtering  routine,  depicted  in  Figure  7-23.  The  size  of  the 
separator  cliques  varies  greatly  throughout  the  program,  and  remains  high  for  the  last  set  of  basic  blocks 
that  follow  block  48.  The  variations  in  the  separator  cliques  can  be  explained  with  the  structural 
complexity  of  the  program.  Note  that  around  basic  block  48,  almost  100%  of  the  live  ranges  have  been 
accumulated.  The  persistently  large  separator  cliques  for  the  same  portion  of  the  program  indicate 
computation  involving  a  large  set  of  variables.  The  frequency  of  small  separator  cliques  in  the  first  pan  of 
the  program  together  with  a  relatively  linear  increase  of  die  accumulated  live  ranges  indicate  that  the 
conflict  graph  can  be  parallelized  successfully. 
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Figure  7-23:  Separator  cliques  and  accumulated  live  ranges  for  filtering 

The  filtering  routine  is  inlined  in  the  select  kernel  shown  in  Figure  7-24.  This  is  reflected  in  the  shape  of 
both  the  accumulated  live  range  curve  and  the  separator  clique  curve:  between  basic  blocks  20  and  80,  the 
shapes  of  both  curves  are  almost  identical  to  the  curves  of  Figure  7-23.  Note  that  overall  the  size  of  the 
separator  cliques  is  much  larger  in  relation  to  the  total  number  of  live  ranges  than  for  the  frequency  domain 
filtering  routine  in  isolation.  This  is  due  to  variables  local  to  select,  that  remain  live  throughout  the  inlined 
code  for  frequency  domain  filtering. 

Our  next  example  is  the  jacobi  kernel  in  Figure  7-25.  The  size  of  the  separator  cliques  is  very  irregular, 
again  caused  by  the  structural  complexity  of  the  program.  In  the  second  half  of  the  flow  graph,  starting  at 
basic  block  31,  die  size  of  the  separator  clique  is  very  large  in  relation  to  the  total  number  of  live  ranges. 
Small  cliques  occur  only  at  the  beginning  of  the  program.  This  example  shows  the  tradeoff  between 
finding  parallel  tasks  of  equal  size  and  the  size  of  the  separator  cliques.  If  the  size  of  the  separator  cliques 
is  the  main  factor  for  the  parallelization,  the  size  of  the  parallel  tasks  must  be  very  uneven  -  on  the  other 
hand  equally  sized  parallel  tasks  can  only  be  obtained  by  partitioning  via  large  separator  cliques. 

As  our  last  two  programs  we  picked  the  two  most  complex  Livermore  kernels,  Kernel  15  and  Kernel  22. 
The  data  for  Kernel  15  is  shown  in  Figure  7-26,  and  the  structure  of  the  code  is  reflected  in  the  size  of  the 
separator  cliques:  the  kernel  consists  of  two  separate  loops,  each  of  which  performs  computations  on 
variables  local  to  each  loop.  The  size  of  the  separator  cliques  is  large  in  the  basic  blocks  that  occur  inside 
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Figure  7-24:  Separator  cliques  and  accumulated  live  ranges  for  select 
the  loops,  and  the  separator  clique  of  basic  block  49  that  partitions  the  first  loop  from  the  second  is 
extremely  small.  This  indicates  that  the  register  conflict  graph  can  be  partitioned  into  two  almost 
completely  independent  parts,  each  of  which  contains  about  50%  of  the  live  ranges. 

The  data  for  Livermore  Kernel  22  is  depicted  in  Figure  7-27.  The  shapes  of  both  the  accumulated  live 
ranges  curve  and  the  separator  clique  curve  are  typical  for  single  loops:  almost  all  live  ranges  are 
accumulated  within  the  first  few  basic  blocks,  and  almost  all  variables  live  throughout  the  loop,  indicated 
by  the  large  size  of  the  separator  cliques  for  the  basic  blocks  that  form  the  loop  body,  blocks  5  through  53. 
Because  the  bulk  of  live  ranges  start  in  the  same  basic  block  (note  the  steep  ascend  of  the  accumulated  live 
range  curve  at  basic  block  5),  it  is  not  possible  to  partition  the  conflict  graph  into  two  subgraphs  that 
contain  equally  many  nodes  -  in  addition  to  very  large  separator  cliques.  Hence,  the  parallelization  of 
global  register  allocation  for  such  graphs  is  useless. 

7.4.1.  Tradeoffs  between  parallelization  strategies 

The  conflict  graph  of  our  single  value  decomposition  program  (Figure  7-21)  has  properties  that  make  it 
easy  to  partition.  Fust,  the  accumulated  live  range  curve  rises  almost  linearly  with  increasing  basic  block 
numbers.  This  means  that  it  is  easy  to  partition  the  register  conflict  graph  into  chunks  of  even  size.  The 
size  of  the  separator  cliques  is  fairly  even,  and,  more  importantly,  a  relatively  small  fraction  (about  10%)  of 
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Figure  7-25:  Separator  cliques  and  accumulated  live  ranges  for  jacobi 
the  total  number  of  live  ranges.  Using  iO  processors  to  process  this  conflict  graph  in  parallel  means  that 
the  size  of  the  separator  clique  is  equal  to  me  number  of  nodes  in  the  partial  conflict  graphs;  hence  the 
number  of  processors  that  can  be  used  effectively  is  smaller  than  10. 

Even  though  the  conflict  graph  of  our  backpropagation  program  has  similar  characteristics,  the  size  of  the 
separator  cliques  is  much  larger  in  relation  to  the  total  number  of  live  ranges  (ca.  25%).  Distributing  global 
register  allocation  for  this  kernel  across  4  processors  means  that  for  each  component  of  the  conflict  graph 
the  size  of  the  separator  clique  is  equal  to  the  number  of  nodes  -  so  only  fewer  than  4  processors  can  be 
used  effectively  in  parallel.  The  large  size  of  the  separator  cliques  is  a  consequence  of  a  large  number  of 
global  variables  that  remain  live  throughout  the  program. 

Our  frequency  domain  filtering  program  produces  a  conflict  graph  in  which  the  varying  sizes  of  the 
separator  cliques  can  not  be  ignored  during  the  pardoning.  About  40%  of  the  separator  cliques  are  very 
small,  whereas  the  other  60%  of  the  separator  cliques  are  very  large  compared  to  the  total  number  of  nodes 
in  the  conflict  graph.  A  simple  partition  based  merely  on  graph  size  might  not  yield  good  speedup.  This 
phenomenon  is  tai  n  to  the  extreme  in  the  example  depicted  in  Figure  7-26  -  only  a  few  clique  separators 
are  small  enough  to  make  parallelization  of  global  register  allocation  worthwhile.  Only  two  clique 
separators  (at  basic  blocks  49,50)  partition  the  conflict  graph  into  large  enough  chunks  so  that  the  maximal 
task  size  becomes  sufficiently  small. 
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Figure  7-26:  Separator  cliques  and  accumulated  live  ranges  for  Livermore  Kernel  15 

To  summarize  our  observations,  good  parallel  speedup  can  only  be  expected  when  the  size  of  the 
separator  clique  is  significantly  smaller  than  the  number  of  nodes  in  the  partial  conflict  graphs.  Our 
measurements  indicate  that  for  small  kernels  this  is  hardly  achieved.  Our  single  value  decomposition 
kemei  is  one  of  the  longest  kernels  in  our  testsuite,  and  the  number  of  global  variables  is  small.  The 
relation  of  the  separator  clique  size  to  the  total  number  of  live  ranges  is  favorable,  and  we  conclude  that  for 
longer  programs,  more  parallelism  can  be  found  in  global  register  allocation. 

While  procedure  inlining  increases  the  size  of  individual  functions,  our  experience  is  that  the  size  of  the 
separator  clique  grows  with  the  context  of  the  inlining  functions:  live  ranges  of  the  inlining  functions 
co-exist  with  the  live  ranges  of  the  inlined  function,  thereby  increasing  the  size  of  the  separator  clique.  One 
possibility  to  overcome  this  problem  is  to  reduce  the  size  of  the  separator  clique  by  splitting  live  ranges  of 
the  inlining  procedure. 
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Figure  7-27:  Separator  cliques  and  accumulated  live  ranges  for  Livermore  Kernel  22 


7  J.  Related  work 

Global  register  allocation  is  based  on  the  conflict  graph  that  denotes  the  register  conflicts  of  the  entire 
program.  In  some  implementations  of  global  register  allocation  via  graph  coloring,  the  entire  register 
conflict  graph  is  built  before  the  coloring  process  starts.  By  assigning  priorities  to  variables  based  on 
where  in  the  program  they  occur,  the  program  structure  implicitly  guides  the  order  in  which  variables  are 
assigned  to  registers.  Different  implementations  differ  mainly  in  the  heuristics  that  are  used  to  decide 
which  variables  to  spill  to  memory  if  there  are  not  enough  available  registers  [ChowHen  90,  Briggs  et  al 
89,  Bernstein  et  al  89]. 

Other  approaches  to  global  register  allocation  use  program  structure  more  exolicitly.  In  the  research 
reported  in  [GupSofSte  89],  the  register  conflict  graph  is  built  and  colored  separately  for  each  path  through 
the  flow  graph.  The  goal  is  to  build  and  color  a  register  conflict  graph  incrementally,  motivated  by  space 
problems  when  the  entire  conflict  graph  is  kept  in  memory.  Since  only  one  straight  line  path  through  the 
flow  graph  is  considered  at  a  time,  the  register  conflict  graph  is  assumed  to  be  an  interval  graph  and  hence 
clique  separators  are  used  for  data  partitioning.  When  the  colorings  if  conflict  graphs  of  individual  paths 
are  combined,  move  instructions  are  used  to  adjust  colorings.  We  see  two  problems  with  this  approach: 
First,  even  though  straight  line  paths  are  considered  individually,  the  presence  of  broken  live  ranges  does 
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not  ensure  that  the  conflict  graphs  for  these  paths  are  really  interval  graphs.  As  a  result,  a  clique  found  in 
the  register  conflict  graph  of  one  path  might  not  even  be  a  sepa-ator  clique  lor  the  conflict  graph  of  one 
individual  path  through  the  flow  graph.  Second,  when  cse  colorings  of  ihe  paths  through  a  cord'tional  a 
merged,  there  is  no  mechanism  to  enforce  that  B GLOBAL  or  LOEX  live  ranges  arc  colored  identic?')  and 
colorings  must  be  adjusted  by  introducing  unnecessary  spill  and  move  instructions. 

A  strictly  hierarchical  approach  to  global  register  allocation  via  graph  coloring  is  repor  eri 

[CallahanKoblenz  91]:  register  conflict  graphs  are  built  and  colored  based  on  the  parse  Bee  of  the  pjvrnm. 
The  motivation  is  to  concentrate  on  the  register  conflict  graphs  of  critical  potions  of  the  input  program,  to. 
example  innermost  loops.  An  overall  coloring  is  derived  by  combining  individual  colorings;  coloring;  that 
do  not  match  are  adjusted  by  spill  and  move  instructions  that  can  be  avoided  with  the  method  presented  in 
this  thesis.  The  trace  scheduling  compiler  described  in  [FreuRut  r)]  uses  a  similar  approach,  though 
register  allocation  is  performed  hand  in  hand  with  the  code  scheduler. 

Common  to  all  approaches  is  their  heuristic  nature.  If  parts  of  the  register  conflict  graph  are  colored 
separately,  the  overall  coloring  is  derived  by  inserting  cleanup  operations  like  move  instructions  if  the  two 
individually  colored  pieces  do  not  fit  together  Even  if  individual  pieces  can  be  colored  optimal’y 
optimality  of  the  overall  coloring  is  in  general  not  preserved  when  colorings  are  combined. 

7.5.1.  Chapter  summary 

Our  method  was  very  successful  for  our  benchmark  kernels.  Every  conditional  could  be  r,<v,<sformed  into 
equivalent  straight  line  code,  in  most  cases  without  node  merging.  Broken  live  ranges  occurred  in  several 
programs,  and  the  number  of  broken  live  ranges  was  related  to  the  structural  complexity  of  the  programs.  In 
all  but  two  cases  (both  in  the  same  program),  holes  could  be  eliminated.  Consequently,  all  except  ow 
kernel  could  be  n  apped  to  a  straight  line  flow  graph  that  was  equivalent  for  the  purpose  of  global  register 
allocation,  and  th  resulting  conflict  graph  was  an  interval  graph.  Only  for  one  kernel  an  interval  register 
conflict  graph  could  not  be  found.  For  that  kernel,  we  identified  a  small  subset  of  nodes  in  the  register 
conflict  graph  that  induced  non-interval  portions  in  the  conflict  graph. 

Because  it  is  difficult  to  predict  the  sequential  execution  time  for  a  non-interval  graph  or  for  a  graph  in 
which  nodes  must  be  spilled  to  memory,  the  parallel  speedup  can  not  be  assessed  accurately.  Our 
measurements  of  separator  clique  sizes  demonstrates  that  our  method  finds  enough  clique  separators  such 
that  a  variety  of  strategies  can  be  applied  to  parallelize  global  register  allocation. 
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Chapter  8 
Conclusions 


Program  structure  plays  an  important  role  for  the  parallelization  of  global  compiler  optimizations.  We 
distinguish  between  structured  and  unstructured  compiler  optimizations.  In  structured  compiler 
optimizations,  data  locality  is  given  by  explicit  program  structure:  results  for  non-nested  programming 
constructs  can  be  combined  to  an  overall  result  without  backtracking.  In  unstructured  compiler 
optimizations,  there  is  no  easy  way  to  oartition  the  data  into  independent  components.  Data  partitioning  is 
based  on  characteristics  of  the  underlying  data  structure,  and  program  structure  is  used  implicitly  to  deduce 
those  characteristics. 

In  this  thesis,  we  developed  parallel  frameworks  for  interval  analysis  and  global  register  allocation, 
representing  a  structured  and  an  unstructured  optimization  respectively.  The  thesis  demonstrates  that  for 
both  types  of  compiler  optimizations,  program  structure  plays  a  key  role  in  finding  a  data  partitioning. 

8.1.  Parallel  interval  analysis 

In  interval  analysis,  the  basic  blocks  of  a  program  flow  graph  are  processed  in  an  order  that  is  given  by 
the  loop  structure  of  the  input  program.  This  leads  to  a  natural  partitioning  of  the  data  for  parallelization: 
the  data  flow  information  of  loops  that  are  not  nested  within  each  other  can  be  computed  independently  and 
combined  to  form  an  overall  result  without  backtracking.  Our  parallelization  of  interval  analysis  is  based 
on  a  tree  called  the  complete  interval  tree  that  captures  the  loop  structure  of  the  input  program.  Each  node 
in  that  tree  is  a  task  in  the  parallel  implementation  of  interval  analysis,  and  all  nodes  in  that  tree  that  are  not 
descendants  or  predecessors  of  each  other  can  be  processed  in  parallel. 

The  goal  of  our  implementation  of  parallel  interval  analysis  was  to  assess  the  number  of  parallel 
processors  that  can  operate  effectively  for  a  given  complete  interval  tree.  When  the  execution  time  for  each 
parallel  task  is  known  beforehand,  the  minimal  time  to  process  the  complete  interval  tree  is  equal  to  the 
length  of  the  longest  path  from  the  root  of  the  tree  to  a  leaf  node.  Given  the  minimal  processing  time  for  a 
complete  interval  tree  T,  we  developed  a  lower  bound  for  the  minimal  number  of  processors  needed  to 
process  T  in  minimal  time  (finding  the  minimal  number  of  processors  is  NP  complete). 

Our  parallelization  of  interval  analysis  is  extremely  simple,  and  one  drawback  is  that  the  data  partitioning 
depends  directly  on  the  loop  structure  of  the  input  program.  Hence,  the  size  of  the  parallel  tasks  is 
pre-determined  by  the  loop  structure,  and  there  is  no  flexibility  when  the  parallel  tasks  are  of  uneven  sizes. 
In  some  cases  this  can  lead  to  the  lack  of  parallelism  when  a  complete  interval  tree  is  processed 
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Our  implementation  shows  that  in  practice  potential  parallel  speedup  decreases  when  the  number  of 
processors  used  is  too  large;  in  all  cases  the  number  of  processors  that  could  be  used  effectively  in  practice 
was  significantly  smaller  than  the  minimal  number  of  processors  needed  to  process  the  complete  interval 
tree  in  minimal  time.  In  other  words,  due  to  system  overhead  the  theoretically  optimal  speedup  could  not 
be  achieved.  We  conclude  that  finer  grained  parallelism  in  which  more  processors  can  be  used  in  parallel 
does  not  result  in  better  observed  speedup,  and  that  our  simple  approach  to  the  parallelization  of  global  data 
flow  analysis  works  well  in  practice. 


8.2.  Global  register  allocation 

While  for  parallel  interval  analysis  the  division  of  the  problem  in  parallel  tasks  is  explicitly  given  by  the 
loop  structure  of  the  input  program,  the  role  of  program  structure  in  the  parallelization  of  global  register 
allocation  is  very  different  In  global  register  allocation,  program  structure  is  used  to  guide  the  analysis  of 
the  register  conflict  graph,  which  must  be  partitioned  for  parallel  processing.  Goal  of  this  analysis  is  to 
locate  clique  separators  in  the  conflict  graph,  and  we  achieve  this  by  determining  portions  of  the  register 
conflict  grapn  that  are  interval  graphs. 

We  used  the  fact  that  register  conflict  graphs  for  straight  line  code  are  interval  graphs,  so  in  a  well 
structured  program  the  only  source  for  non-interval  portions  of  the  register  conflict  graph  are  conditionals 
and  loops.  When  certain  restrictions  are  met  by  the  live  ranges  that  occur  in  conditionals  or  loops,  we  can 
infer  that  the  corresponding  portion  of  the  register  conflict  graph  is  an  interval  graph.  Even  when  those 
restrictions  are  not  met  by  the  live  ranges  of  a  loop  or  a  conditional,  there  are  situations  in  which  we  can 
create  an  interval  register  conflict  graph  via  node  merging. 

To  characterize  live  ranges  that  are  responsible  for  non-interval  register  conflict  graphs,  we  developed  a 
structured  model  for  global  register  allocation  in  which  knowledge  about  program  structure  is  encoded  in 
the  live  ranges  of  the  input  program.  Given  this  knowledge  in  a  register  conflict  graph,  the  conflict  graph  is 
analyzed  in  a  structured  way,  one  loop  or  conditional  at  a  time. 

Our  model  provides  a  basis  that  permits  us  to  improve  register  allocation,  both  for  sequential  and  parallel 
implementations.  The  advantages  of  our  method  for  sequential  register  allocation  are  as  follows: 

1.  Interval  portions  of  the  register  conflict  graph  are  detected  and  can  be  colored  optimally  in 
linear  time. 

2.  Clique  separators  can  be  found  systematically  in  interval  potions  of  ‘he  register  conflict 
graph.  Clique  separated  components  of  the  register  conflict  graph  can  be  colored 
independently  and  combined  to  an  overall  coloring  in  linear  time  without  compromising  the 
quality  of  the  overall  solution.  Being  able  to  partition  a  register  conflict  graph  into  small 
components  means  that  the  overall  running  time  can  be  reduced. 

3.  All  cliques  in  interval  graphs  can  be  enumerated.  The  size  of  the  cliques  in  interval  portions 
of  register  conflict  graphs  is  known.  Only  live  ranges  that  are  members  of  cliques  that  are 
larger  than  the  number  of  registers  must  be  considered  for  spilling.  In  most  cases,  live  ranges 
that  must  be  considered  for  spilling  are  only  a  subset  of  all  live  ranges  of  the  input  program, 
and  this  subset  is  known  before  the  coloring  process.  This  is  a  definite  advantage  over 
standard  methods  for  spilling  in  which  all  live  ranges  that  are  still  part  of  the  register  conflict 
graph  must  be  considered  for  spilling  indiscriminately. 
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4.  In  general  finding  an  optimal  coloring  for  a  non-interval  portion  of  the  register  conflict  graph 
is  NP  complete.  Our  measurements  indicate  that  such  portions  usually  consist  of  onl>  small 
parts  of  a  register  conflict  graph,  so  using  expensive  heuristics  or  even  exhaustive  search  for 
the  coloring  such  portions  need  not  be  prohibitively  expensive.  Non-interval  portions  of  the 
register  conflict  graph  are  clique  separated  from  the  interval  portions,  so  good  (or  optimal) 
colorings  for  non-interval  portions  a  oe  incorporated  into  an  overall  coloring  in  linear  time. 

5.  The  knowledge  about  clique  separators  is  useful  for  hierarchical  register  allocation:  crucial 
portions  of  the  register  conflict  graph  (example:  portions  that  correspond  to  innermost  loops 
of  the  flow  graph)  can  be  colored  separately  and  combined  with  the  overall  coloring  in  a 
systematic  and  structured  way. 

8.2.1.  Evaluation  of  the  mode! 

Goal  of  our  implementation  was  to  evaluate  how  often  restrictions  that  enable  us  to  detect  or  create 
interval  register  conflict  graphs  are  met  by  the  live  ranges  of  realistic  programs.  Our  model  is  sufficient  for 
the  analysis  of  register  conflict  graphs  when  we  are  able  to  detect  large  interval  portions  and  many  clique 
separators.  Our  measurements  show  that  for  most  input  programs  our  method  is  able  to  detect  interval 
register  conflict  graphs  and  therefore  a  large  number  of  clique  separators. 

8.2.2  Parallel  global  register  allocation 

Given  the  knowledge  about  the  location  of  clique  separators  in  a  register  conflict  graph,  the 
parallelization  of  global  register  allocation  is  straightforward.  A  partitioning  of  a  register  conflict  graph 
that  allows  good  parallel  speedup  must  have  the  following  characteristics: 

•  The  size  of  the  graph  portions  that  are  processed  in  parallel  must  be  equal. 

•  The  size  of  the  separator  cliques  must  be  small  to  decrease  the  re-combination  cost. 

Our  results  indicate  that  for  small  kernels,  the  size  of  the  separator  cliques  is  so  large  that  only  very  few 
(2-4)  processors  could  be  used  efficiently  in  parallel.  For  larger  kernels  the  size  of  the  separator  cliques 
tended  to  be  small  in  relation  to  the  total  number  of  nodes  in  the  conflict  graphs  and  therefore  the  number 
of  processors  that  i  be  used  in  parallel  effectively  is  a  little  larger.  The  abundance  of  separator  cliques 
allows  a  variety  of  strategies  for  the  parallelization. 

8.3.  Future  work 

We  examined  the  correlation  between  program  structure  and  global  compiler  optimizations  under  two 
different  aspects.  First,  the  explicit  use  of  program  structure  for  the  parallelization  of  compiler 
optimizations  that  operate  based  on  a  program’s  loop  structure.  The  effectiveness  of  this  approach  was 
shown  for  interval  analysis,  and  the  implementation  of  parallel  interval  analysis  of  only  one  data  flow 
equation  showed  potential  for  speedup  in  practice.  Second,  the  implicit  use  of  program  structure  for  the 
analysis  of  data  structures  processed  in  less  structured  optimizations  which  leads  to  a  subsequent 
parallelization.  By  developing  a  structural  model  for  global  register  allocation  we  showed  that  this  concept 
works  well. 
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8.3.1.  Explicit  use  of  program  structure:  extensions 

Our  work  in  parallel  interval  analysis  can  be  deepened  in  two  aspects.  First,  it  is  important  to  examine 
the  correlation  between  program  structure  and  the  best  possible  speedup  in  detail.  Ideally  this  work  should 
lead  to  a  simple  model  that  permits  us  to  predict  the  expected  parallel  speedup  for  a  given  input  program. 
This  is  particularly  important  for  practical  implementations  of  parallel  compilers:  the  automatic 
parallelization  of  such  optimizations  is  only  successful  when  the  amount  of  parallelism  and  therefore  the 
number  of  processors  that  can  operate  effectively  in  parallel  can  be  predicted  accurately  and  quickly. 

Second,  it  would  be  interesting  to  evaluate  the  approach  for  optimizations  that  fit  into  the  same 
framework  as  parallel  interval  analysis  but  are  more  complex.  Examples  are  loop  vectorization  and 
software  pipelining:  non-nested  loops  can  be  treated  in  parallel,  and  the  amount  of  work  that  must  be 
performed  in  each  parallel  task  is  larger  than  the  amount  of  work  required  to  perform  bit  vector 
intersections  in  data  flow  analysis. 

8.3 2.  Global  register  allocation:  future  directions 

We  showed  that  our  concept  of  using  program  structure  for  the  parallelization  of  unstructured 
optimizations  works  well  for  global  register  allocation  but  there  are  few  optimizations  that  fit  into  the 
specific  framework  of  graph  coloring.  Our  structured  model  for  global  register  allocation  serves  as  a  basis 
for  the  improvement  of  spilling  and  splitting  heuristics  for  global  registe  .  allocation.  This  aspect  of  our 
model  can  only  be  evaluated  when  it  is  incorporated  into  the  backend  of  a  compiler. 

Because  of  the  importance  of  inter  procedural  optimizations,  it  would  be  interesting  to  extend  the  model 
to  handle  inter  procedural  register  allocation.  Like  for  loops  and  conditionals,  this  involves  the 
development  of  boundary  conditions  for  procedure  calls.  Further,  the  model  can  be  adapted  to  handle 
register  hierarchies  and  register  conventions.  One  example  of  a  register  convention  is  the  use  of  specific 
registers  for  parameters. 

In  general  we  believe  that  for  well  structured  programs,  compiler  optimizations  that  are  reduced  to  NP 
complete  problems  can  be  improved  when  knowledge  about  program  structure  is  used  when  that  NP 
complete  problem  is  solved.  For  global  register  allocation  the  thesis  demonstrated  that  this  is  possible. 
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